Scipy 2020 Day Three

Deep Learning from Scratch, Pandera, Awkward Arrays, and Scalabe ML with Ray

The third day of SciPy 2020 was filled with interesting and foundational tutorial content regarding deep learning with a short primer to the PyTorch library and I found the time to watch some interesting SciPy talks from Enthoughts SciPy Youtube channel as well. Fortunately, my internet provider Vodafone decided late in the evening to have massive connectivity problems all over Mannheim so that I didn’t miss too much.

My third day began with the talks Pandera: Statistical Data Validation of Pandas Dataframes from Niels Bantilan, Ray: A System for Scalable Python and ML held by Robert Nishihara, and Machine Learning Model Serving: The Next Step from Simon Mo, before I participated at the tutorial Deep Learning from Scratch with PyTorch. After the tutorial I managed to watch some additional conten and watched the talks Analyzing the Performance of Python Applications Using Multiple Levels of Parallelism from Christian Harold, a fellow researcher from Dresden and the introductionary talk Awkward Array: Manipulating JSON like Data with NumPy like Idioms held by Jim Pivarski.

Together with the content from the former two days, my reading and research list ist starting to grow extremely fast and I guess I need to find some time slots for isolated batch processing 😄

So, let’s start with a little bit of content review.

Pandera: Statistical Data Validation of Pandas Dataframes | Niels Bantilan

pandera is a data validation library for correctness and hypothesis tests on Pandas dataframes in tidy (long-form) and wide data format (see Tidy Data, Tidy Types, and Tidy Operations for more informations on tidy data). For runtime validation of pandas data structures pandera uses informations pandas provides.

Pandera is able to check types and properties of columns in a pd.DataFrame or values in a pd.Series.

Furthermore, pandera is able to perform more complex statistical validations like hypothesis testing and integrates with existing data science processing pipelines through function decorators.

Make sure to watch the very good talk from Niels Bantilam.

Ray: A System for Scalable Python and ML | Robert Nishihara

Robert Nishihara presented Ray, which tries to consolidate the merging areas of big data stores, deep learning, web services and high performance computing by helping with distributed computing and model serving.

Ray allows for execution and utilizing parallelism on your local computer as well as scaling out to computing clusters on Legion, AWS, Azure, and Google Cloud and proves to be similar to tools like Dask while mainting ease of use.

Machine Learning Model Serving: The Next Step | Simon Mo

RayServe, an independent compontent of the formerly introduced Ray, addresses the problem of model serving for interactive scoring or batch predictions. Simon Mo does a great job highlighting the issues one may experience with obvious solution like serving a model via, e.g. a Flask web service and furthermore discusses the common alternativ and short-comings of using an externalized tensor prediction service as utilized by TFServing, SageMaker, and others.

RayServe claims to overcome the deficiencies of both approaches via an easy to use CLI interface and recipes. Make sure to watch his video if you are interested in model serving at scale!

Tutorial: Deep Learning from Scratch with PyTorch

The four hour tutorial about Deep Learning from Scratch with PyTorch is a really great ressource about the foundations of implementing neural networks with NumPy an the error prone complexity it involves and why PyTorch should be used in the first place. Additionally, one of the most noteworthy and commendable facts is that the tutors Hugo Bowne-Anderson and Dhavide Aruliah started with possible social issues deep learning could pose.

Unfortunately after the basics there were only 15 minutes left for the PyTorch part and Deep Learning with its own advantages and disadvantages was not or almost not introduced anymore. Nevertheless the linked resources are a great learning resource with many facts, tips and tricks.

As with the other tutorials, the YouTube video will be linked as soon as it is generally available. In the meantime, be sure to check out the Jupyter notebooks, they are awesome ❤️

[Deep Learning from Scratch with PyTorch]

Analyzing the Performance of Python Applications Using Multiple Levels of Parallelism | Christian Harold

Score-P introduces a scalable performance measurement infrastructure for parallel codes and comes with a practical Python module - Score-P Python or instrumenting Python code. In general, Score-P provides serveral instrumentation wrappers for process-level, thread-level, accelerator bases parallelism as well as wrappers for external code from OpenBLAS, LAPACK, and FFTW. Basically, Score-P provides a view on parallelism and ressource usage for highly distributed application code. Additionally, there are multiple extensions providing different views on application behavior and it integrates with Vampir and Cube.

Awkward Array: Manipulaing JSON like Data with NumPy like Idioms | Jim Pivarski

Awkward Array is a library from the scikit-hep environment that introduces NumPy like idioms for JSON like data. Awkward Array is useful if you need to have JSON or dictionary like data in array format for scalability and performance purposes while being able to do mathematics and sclicing.

Additionally, Awkward Array integrates with Numba optimizations, e.g. Numba compiled functions. The library is exceptionally well suited to handle GeoJSON data at scale!

Kind regards,

Henrik Hain

Henrik Hain
Henrik Hain
Data Scientist / Data Engineer

My (research) interests evolve around the practical and theoretical aspects of software engineering, (self-) learning systems and algorithms, especially (deep) reinforcement learning, spatio-temporal event detection, and computer vision approaches.