Handling Missing Data with Graph Representation Learning

GRAPE is a general framework for feature imputation and label prediction in the presence of missing data. We show that a seemingly unrelated missing data problem (imputing missing values and learning subsequent tasks) can naturally be solved with graphs and we propose the first graph-based solution to solve the problem.

Motivation
Method
Key Results
Code and Datasets
Contributors
References

Motivation

Issues with learning from incomplete data arise in many domains including computational biology, clinical studies, survey research, finance, and economics. The missing data problem has previously been approached in two different ways:

Feature imputation: missing feature values are estimated based on observed values.
Label prediction: downstream labels are learned directly from incomplete data.

While existing approaches have rich theoretical groundings, they often share common limitations:

They often make assumptions on the data distribution.
They often cannot adapt to downstream tasks.
Many approaches cannot generalize to unseen data without retraining.

Importantly, there lacks a unified solution for the missing data problem.

Method

Here we propose GRAPE, a general framework for feature imputation and label prediction in the presence of missing data. Our key innovation is to formulate the problem using a bipartite graph representation, where observations and features are two types of nodes, and the observed feature values are attributed edges between the nodes. Under this graph representation, we formulate the feature imputation as an edge-level prediction task, and the label prediction as a node-level prediction task, as is shown in Figure 1.

Figure 1: Overview of the proposed GRAPE framework

GRAPE solves both tasks via a Graph Neural Network. GRAPE is inspired by GraphSAGE with three innovations:

We use augmented node features to initialize observation and feature nodes, which provides greater representation power and maintains inductive learning capabilities.
We employ an edge dropout technique that greatly improves model generalization.
We introduce edge embeddings during message passing to fully utilize the rich attribute information.

Figure 2: GRAPE forward computation

Key Results

We experiment on 9 UCI datasets from different domains. We compare GRAPE against 5 commonly used imputation methods, and state-of-the-art deep learning models. We show in Figure 3 that GRAPE yields 20% lower mean absolute error (MAE) for feature imputation, and 10% lower MAE for label prediction, compared with the best baselines.

Figure 3: Overall comparisions over GRAPE and baselines.

Additionally, we vary the missing data ratio from 10% to 70%. For both tasks, we observe consistent gains under different missing data ratios.

Figure 4: Comparisions when the missing data raio varies.

Please refer to our paper for detailed explanations and more results.

Code and Datasets

We release GRAPE on GitHub. The datasets are included in the code repository.

Contributors

Jiaxuan You*
Xiaobai Ma*
Daisy Yi Ding*
Mykel Kochenderfer
Jure Leskovec

References

Handling Missing Data with Graph Representation Learning. Jiaxuan You*, Xiaobai Ma*, Daisy Yi Ding*, Mykel Kochenderfer, Jure Leskovec. Neural Information Processing Systems (NeurIPS), 2020.