SciPy 2020 Day Two
Half of the Dask tutorial, Continuous Integration for Scientific Projects and more…
Day two of the SciPy 2020 conference was also very informative. Except of some connectivity issues with my internet provider, which lead to missing the latter half of the awesome Dask tutorial and prevented me from listening to other talks, everything went equally smooth. My SciPy day started nicely with a very informative talk with my assigned mentor Hongsup Shin, an experienced data scientist at Arm, who gave me tips and hints on how to get more involved in the field of open source scientific development, which was very helpful for me. Also at this point I would like to thank Hongsup for his support. Apart from these highlights, I watched a talk about Boost-histograms, Continious Integration for Scientific Python Projects, and watched the current daily Welcome, SciPy Tools Plenary Session, and HPC Python Q&A session which I will briefly introduce in the next sections.
Boost-histogram: High-Performance Histograms as Objects |SciPy 2020| Schreiner, Pivarski & Dembinski
Schreiner et. al generalized the usual single operation, producing serveral arrays which together build a histogram, by introducing histograms as high performance Python object. Boost-histograms are built on top of Boost libraries' Histogram in C++14. Their work is basically meant as foundation others are able to build on. Actually in the Scikit-HEP project, there are already a physicist friendly front-end called ‘Hist’ and a conversion package with the name ‘Aghast’ designed around boost-histogram.
One of the benefits is, that boost-histograms are Python objects allowing fancy indexing with tidy names and callbacks additionally to simple manipulation through NumPy and that they are easily handable in a threaded program workflow. Furthermore, they are fast, usually twice as fast as naive, unoptimized NumPy alternatives, and threaded even faster. Make sure to watch the talk!
Tutorial: Parallel and Distributed Computing in Python with Dask
Continuous Integration for Scientific Python Projects |SciPy 2020| Stanley Seibert
This talk is actually a set of good software engineering continuous integration best practices and tips and tricks applied to scientific python projects. For me, there wasn’t anything surprising or unknown here, but it’s very good to recall all the necessary important balancing decisions sometimes. The key takeaway should be, that continuous integration is essentially a process with task which should be ordered along the importance/impact axes and ticked of in sequence. If you want to make your life easier and don’t know the best practices or want to recall the important parts -> watch this talk! 😄
Welcome & SciPy Tools Plenary Session
Today I also participated at the ‘Welcome & SciPy Tools Plenary Sessions’. During the Welcome session the remainder of the daily schedule was presented and the gold sponsors were honoured. It was relatively quick session mainly intended for introducing the SciPy Tools Plenary session. During the SciPy Tools Pleanry session the following SciPy Python libraries where mentioned.
Pandas version 1.1 will be released in July 2020 and the funding was secured with the help of CZI and NumFOCUS. With the Pandas 1.0 release a new Pandas logo was introduced with improvements regarding the Pandas website theme and the documentation.
Pandas got serveral larger and smaller improvements w.r.t Numba integration and new data types like dedicated nullable pd.NA for missing data and ExtensionArrays. User defined functions (UDF) are now able to reduce python overhead when using the Numba engine.
Numba is adopted by more large projects, like Awkward and Pandas and underwent massive refactorings as the code base was reorganized and old Python and NumPy version support and hacks were removed. Numba now has better error reporting too and benefits from a growing user community with > 1.4M downloads/month. It is noteworthy that the people behind Numba work on a new governance model to encourage new maintainers and contributers to participate in development.
Plotly a plotting library which integrates with Python, R and Julia and core of the Dash framework has introduced a lot of improvements, like the Express high level API, removal of the binding to external servers, improved tile maps with support for all tile servers, hierarchical data plot, imshow, and serveral speedups. Furthermore, they integrated a new jupytertext backend for documentation purposes and a new Python API reference interface.
Currently Plotly has about 3M downloads/month and more than 50 Github contributers.
xarray a library for working with labelled multi-dimensional arrays, improved through implementing sparse integrations and other NEP-18 array integrations and support for dataset argmin, argmax as well as dataset weighted mean functionality.
They got funded now by CZI and currently need help with improving the documentation.
sklearn refreshed their website and they further improved the user guide. A new plotting API based on Matplotlib was introduced. sklearn now has support for stacked classifiers (output of one classifier is a feature for the next one), HistGradientBoosting with missing value support and monotonic constraints as well as nearest neighbor graphs don’t need a complete recomputation upon introducing new data points. Additionally, the are minor improvements regarding the kmeans algorithm scalability and generalized linear models were implemented.
One awesome feature is, that sklearn now has rich visual interactive representations of estimators and computation pipelines!
Matplotlib got one year of funding from the Chan Zuckerber Initiative and introduced test baselin image relocation (GSoC). There were multiple bugfix releases last year and they are preparing for a 3.4 release in September 2020 with new bar3d light source support, simplified tick formatters, and a new mosaic subplot as well as post-hoc axes sharing. They further try to improve the current documentation.
I also watched the punalicious lightning talks at day 2 of SciPyConf 2020 - Many thanks to the hilariously funny hosts. I espacially remembered the funny talks about Rapids an end2end accellerated GPU data science library, Frappucino, a library helping with API changes and a study about visual pleasing color cycles intended for improving coloring for people suffering from duteranopia, protanopa or tritanopia. Colors must be accessible to anybody!