Stochastic regression imputation can be considered a refinement of regression imputation because it addresses the correlation bias by adding noise from the regression residuals to the missing value estimations. This post discusses the advantages of stochastic regression imputation with examples in Python.
Deploying descriptive, predictive and prescriptive machine learning solution using complete data is difficult, but even more difficult in face of missing data. A gentle introduction to the reasons of missing data and the difficulties generated.
The third day of SciPy 2020 was filled with interesting and foundational tutorial content regarding deep learning with a short primer to the PyTorch library and I found the time to watch some interesting SciPy talks from Enthoughts SciPy Youtube channel as well.
I am very happy 😄 to participate at the 2020 edition of the SciPy conference, which is held online thanks to the measures that prevent the spread of the COVID-19 virus. Although it is the first online version of the SciPy conference, everything works fine and fluently due to the tremendous help from the organizers and community.
Missing data, especially missing data points in time series, are a pervasive issue for many applications relying on precise and complete data sets. For example in the financial sector, missing tick data can lead to deviating forecasts and thus to wrong decisions with high losses.
Effective and efficient time series representation learning poses an important topic for a vast array of applications like, e.g. clustering. Many currently used approaches share the property of being difficult to interpret though. In many areas it is important that intermediate learned representations are easy to interpret for efficient downstream processing.
As ubiquitous as time series are, it is often of interest to identify clusters of similar time series in order to gain better insight into the structure of the available data. However, unsupervised learning from time series data has its own stumbling blocks. For this reason, the following article presents some helpful time series specific distance metrics and basic procedures to work successfully with time series data.
The notion of tidy data is a concept known from R and used in many available libraries and frameworks today with great success. Tidy data together with proper data types and semantically allowed operations simplifies data science, machine learning and data stewardship by a large margin. In this article we will highlight the core properties of "Tidy Data, Tidy Types, and Tidy Operations" with the help of a concise example and how those properties can be successively achieved and maintained.
Learning to optimally rank and personalize search results is a difficult and important topic in scientific information retrieval as well as in online retail business, where we typically want to bias customer query results with respect to specific preferences for the purpose of increasing revenue. Reinforcement learning, as a generic-flexible learning model, is able to bias, e.g. personalize, learning-to-rank results at scale, so that externally specified goals, e.g. an increase in sales and probably revenue, can be achieved. This article introduces the topics learning-to-rank and reinforcement learning in a problem-specific way and is accompanied by the example project 'cli-ranker', a command line tool utilizing reinforcement learning principles for learning user information retrieval preferences regarding text document ranking.
Quite often it is the case that cyclic data is not sufficiently transformed for machine learning algorithms, e.g. feature representation is missing out on the implicit properties of cyclic features often resulting in wrong distance measures. This article introduces cyclic feature transformation for time based features as a mini-howto.