SciPy 2020 Day One

Dabl(ling) with new Libraries and an Introduction to Baysian Data Science

I am very happy πŸ˜„ to participate at the 2020 edition of the SciPy conference, which is held online thanks to the measures that prevent the spread of the COVID-19 virus. Although it is the first online version of the SciPy conference, everything works fine and fluently due to the tremendous help from the organizers and community. The first SciPy conference videos were pre-recorded and uploaded on Sunday by Enthought on Youtube, whereas tutorials and sessions are held live. Q&A regarding specific topics will take place following the live sessions, today on Monday for the machine learning sessions.

As I participated at the awesome 4 hour Bayesian stats modelling session held by Eric Ma, I had only time for some additional content and the plenum discussions. The remainder of this blog will give a quick intro to the sessions I participated in.

Note: This blog and the references will be extended, as it currently only represents a quick summary of my first day at SciPy 2020!

dabl: Automatic Machine Learning with a Human in the Loop

Scikit-learn core developer Andreas C. MΓΌllers library dabl, an acronym for data analysis baseline library, intends to make supervised learning and fast AutoMl easier for machine learning beginners through reducing common boilerplate code. The library addresses the clean, visualize, and data explain segments of a data science life cycle. One of the main advantages of dable is that it does not introduce high time requirements, as usual for AutoML approaches, and is therefore very suitable for prototyping.

To get a quick introduction to dabl, make sure to watch the following SciPy 2020 talk.

Bayesian Stats Modelling Tutorial

Wow, what can I say? Eric Ma, a bio medical researcher at Novartis Institutes, did a wonderful job introducing classical frequentist statistical modelling with the help of NumPy, pandas and simply Matplotlib for visualization purposes, followed by the more modern ;) Bayesian inference approach with the help of PyMC3. The whole tutorial had a duration of four hours and he delivered a plethora of great references and advises, actually so much, that I will need some time to sort and evaluate them. In the reference section you will find the links and notebooks I have collected.

The accompanying repository, where you will find all covered materials is available from Github: bayesian-stats-modelling-tutorial.

Sadly the video of his session is currently not available online, but I will add it as soon as it is available at Youtube.

Optimizing Humans and Machines to Advance Science

Finally, Ana Comesana, a Scientific Engineering Associate at Lawrence Berkeley National Laboratory presented her approach for identifying a systematic and robust approach for feature selection, e.g. interpretable dimensionality reduction on 1800 mulecular descriptors for predicting performance properties of bio-jet fuels. She addressed fundamental problems that arise when using machine learning in connection with scientific research, especially the correlation is not causation problem. Make sure to watch her talk! πŸ˜„

References

Henrik Hain
Henrik Hain
Data Scientist / Data Engineer

My (research) interests evolve around the practical and theoretical aspects of software engineering, (self-) learning systems and algorithms, especially (deep) reinforcement learning, spatio-temporal event detection, and computer vision approaches.

Related