SciPy 2020 Day Two

Half of the Dask tutorial, Continuous Integration for Scientific Projects and more…

Day two of the SciPy 2020 conference was also very informative. Except of some connectivity issues with my internet provider, which lead to missing the latter half of the awesome Dask tutorial and prevented me from listening to other talks, everything went equally smooth. My SciPy day started nicely with a very informative talk with my assigned mentor Hongsup Shin, an experienced data scientist at Arm, who gave me tips and hints on how to get more involved in the field of open source scientific development, which was very helpful for me. Also at this point I would like to thank Hongsup for his support. Apart from these highlights, I watched a talk about Boost-histograms, Continious Integration for Scientific Python Projects, and watched the current daily Welcome, SciPy Tools Plenary Session, and HPC Python Q&A session which I will briefly introduce in the next sections.

Boost-histogram: High-Performance Histograms as Objects |SciPy 2020| Schreiner, Pivarski & Dembinski

Schreiner et. al generalized the usual single operation, producing serveral arrays which together build a histogram, by introducing histograms as high performance Python object. Boost-histograms are built on top of Boost libraries’ Histogram in C++14. Their work is basically meant as foundation others are able to build on. Actually in the Scikit-HEP project, there are already a physicist friendly front-end called ‘Hist’ and a conversion package with the name ‘Aghast’ designed around boost-histogram.

One of the benefits is, that boost-histograms are Python objects allowing fancy indexing with tidy names and callbacks additionally to simple manipulation through NumPy and that they are easily handable in a threaded program workflow. Furthermore, they are fast, usually twice as fast as naive, unoptimized NumPy alternatives, and threaded even faster. Make sure to watch the talk!

Tutorial: Parallel and Distributed Computing in Python with Dask

This section will be updated with content and the video link as soon as I saw the part which I missed due to connectivity issues 😄.

Continuous Integration for Scientific Python Projects |SciPy 2020| Stanley Seibert

This talk is actually a set of good software engineering continuous integration best practices and tips and tricks applied to scientific python projects. For me, there wasn’t anything surprising or unknown here, but it’s very good to recall all the necessary important balancing decisions sometimes. The key takeaway should be, that continuous integration is essentially a process with task which should be ordered along the importance/impact axes and ticked of in sequence. If you want to make your life easier and don’t know the best practices or want to recall the important parts -> watch this talk! 😄

Welcome & SciPy Tools Plenary Session

Today I also participated at the ‘Welcome & SciPy Tools Plenary Sessions’. During the Welcome session the remainder of the daily schedule was presented and the gold sponsors were honoured. It was relatively quick session mainly intended for introducing the SciPy Tools Plenary session. During the SciPy Tools Pleanry session the following SciPy Python libraries where mentioned.

Pandas

Pandas version 1.1 will be released in July 2020 and the funding was secured with the help of CZI and NumFOCUS. With the Pandas 1.0 release a new Pandas logo was introduced with improvements regarding the Pandas website theme and the documentation.

Pandas got serveral larger and smaller improvements w.r.t Numba integration and new data types like dedicated nullable pd.NA for missing data and ExtensionArrays. User defined functions (UDF) are now able to reduce python overhead when using the Numba engine.

Numba

Numba is adopted by more large projects, like Awkward and Pandas and underwent massive refactorings as the code base was reorganized and old Python and NumPy version support and hacks were removed. Numba now has better error reporting too and benefits from a growing user community with > 1.4M downloads/month. It is noteworthy that the people behind Numba work on a new governance model to encourage new maintainers and contributers to participate in development.

Plotly

Plotly a plotting library which integrates with Python, R and Julia and core of the Dash framework has introduced a lot of improvements, like the Express high level API, removal of the binding to external servers, improved tile maps with support for all tile servers, hierarchical data plot, imshow, and serveral speedups. Furthermore, they integrated a new jupytertext backend for documentation purposes and a new Python API reference interface.

Furthermore, they announced a new static image export system, Python to JavaScript access, expanding Plotly express and better support for sequence and periodic data.

Currently Plotly has about 3M downloads/month and more than 50 Github contributers.

xarray

xarray a library for working with labelled multi-dimensional arrays, improved through implementing sparse integrations and other NEP-18 array integrations and support for dataset argmin, argmax as well as dataset weighted mean functionality.

They got funded now by CZI and currently need help with improving the documentation.

sklearn

sklearn refreshed their website and they further improved the user guide. A new plotting API based on Matplotlib was introduced. sklearn now has support for stacked classifiers (output of one classifier is a feature for the next one), HistGradientBoosting with missing value support and monotonic constraints as well as nearest neighbor graphs don’t need a complete recomputation upon introducing new data points. Additionally, the are minor improvements regarding the kmeans algorithm scalability and generalized linear models were implemented.

One awesome feature is, that sklearn now has rich visual interactive representations of estimators and computation pipelines!

matplotlib

Matplotlib got one year of funding from the Chan Zuckerber Initiative and introduced test baselin image relocation (GSoC). There were multiple bugfix releases last year and they are preparing for a 3.4 release in September 2020 with new bar3d light source support, simplified tick formatters, and a new mosaic subplot as well as post-hoc axes sharing. They further try to improve the current documentation.

bokeh

Bokeh now supports static and dynamic plots and improved WebGL support using ReGL. They start to support server side rendering of large datasets and are promoting the Bokeh protocol for interactions between Python and JavaScript and are trying to decouple Bokeh from Tornado and Websockets. Furthermore they want to allow Python callbacks without a separate Bokeh server. For future updates, they want to introduce generalized and abstract widgets and interactions, improve the user experience regarding building dashboards and tools, and want to adopt LaTeX and MathText support. It is intended to tighten the BokehJS libraries and improve the memory performance.

Lightning Talks

I also watched the punalicious lightning talks at day 2 of SciPyConf 2020 - Many thanks to the hilariously funny hosts. I espacially remembered the funny talks about Rapids an end2end accellerated GPU data science library, Frappucino, a library helping with API changes and a study about visual pleasing color cycles intended for improving coloring for people suffering from duteranopia, protanopa or tritanopia. Colors must be accessible to anybody!

Kind regards,

Henrik Hain

Henrik Hain
Henrik Hain
Data Scientist / Data Engineer

My (research) interests evolve around the practical and theoretical aspects of software engineering, (self-) learning systems and algorithms, especially (deep) reinforcement learning, spatio-temporal event detection, and computer vision approaches.

Related