Missing Data Imputation Using Generative Adversarial Nets

Missing data, especially missing data points in time series, are a pervasive issue for many applications relying on precise and complete data sets. For example in the financial sector, missing tick data can lead to deviating forecasts and thus to wrong decisions with high losses. So it is not surprising that neural networks and in particular deep generative adversarial networks are utilized for missing data imputation with the goal of generating high precision imputation data of high reliability.

To be able to follow up nicely with this post one has to know, that missing data comes in three distinct categories.

  1. Data may be missing completely at random (MCAR)
  2. Data may be missing at random (MAR)
  3. Data may be missing not at random (MNAR)

Data missing completely at random (MCAR) is if the missingness happens entirely random, which means that missingness does not depend on any of the variables. Data missing at random is if the missingness depends only on the observed variables, whereas data missing not a random (MNAR) is, when the mechanism why the data is missing does not correspond to the former two mechanisms MCAR and MAR, meaning that missingness depends both on the observed and unobserved variables.

Furthermore, it is helpful to know that imputation methods can be categorized as either discriminative or generative. Some representatives from the discriminative imputation category are

and some generative imputation mechanisms would be

In this post we are reviewing a more recent novel generative approach from 2018 by Yoon et al. They use a customized generative adversarial network for missing data imputation called GAIN which outperforms MICE, MissForest, Matrix, Auto-encoders, and EM by a rather large margin on serveral different datasets regarding the following performance metrics

  • Area under the curve (AUROC)
  • Mean bias
  • Mean square error (MSE).

How does it work?

GAIN is an imputation method based on, and generalizing, the well-known generative adversarial network GAN Goodfellow et al. and is able to operate successfully even when complete data is not available. The goal of the generator is to accurately impute missing data, while the discriminator’s goal is to distinguish between observed and imputed components. This means that the discriminator is trained to minimize a classification loss function, while the generator tries to maximize the missclassification rate of the discriminator. GAIN adapts the standard GAN architecture by providing the discriminator with a so-called Hint matrix to ensure that the adversarial process optimizes the desired target.

The architecture of GAIN. Image: GAIN: Missing Data Imputation using Generative Adversarial Nets

As they try to model the distribution of the data, with GAIN it is possible to use multiple imputation draws for capturing the uncertainty of the imputed values.

Another interesting aspect of GAIN is, that the discriminator is not trying to reject time series generated by the generator as generally wrong but is instead trying to discriminate real values from imputed ones. For this mechanism to work, they introduce a so called hint mechanism as an additional input to the discriminator.

Yoon et al.’s GAIN implementation is available from Github too. Have a look at it!

Kind regards,

Henrik Hain

Henrik Hain
Henrik Hain
Senior Data Scientist / Data Engineer

My (research) interests evolve around the practical and theoretical aspects of software engineering, (self-) learning systems and algorithms, especially (deep) reinforcement learning, spatio-temporal event detection, and computer vision approaches.