Handling Missing Data with Graph Representation Learning
GRAPE is a general framework for feature imputation and label prediction in the presence of missing data.
We show that a seemingly unrelated missing data problem (imputing missing values and learning subsequent tasks) can naturally be solved with graphs and we propose the first graph-based solution to solve the problem.
Motivation
Issues with learning from incomplete data arise in many domains including computational biology, clinical studies, survey research, finance, and economics.
The missing data problem has previously been approached in two different ways:
- Feature imputation: missing feature values are estimated based on observed values.
- Label prediction: downstream labels are learned directly from incomplete data.
While existing approaches have rich theoretical groundings, they often share common limitations:
- They often make assumptions on the data distribution.
- They often cannot adapt to downstream tasks.
- Many approaches cannot generalize to unseen data without retraining.
Importantly, there lacks a unified solution for the missing data problem.
Method
Here we propose GRAPE, a general framework for feature imputation and label prediction in the presence of missing data.
Our key innovation is to formulate the problem using a bipartite graph representation,
where observations and features are two types of nodes,
and the observed feature values are attributed edges between the nodes.
Under this graph representation, we formulate the feature imputation as an edge-level prediction task,
and the label prediction as a node-level prediction task, as is shown in Figure 1.
Figure 1: Overview of the proposed GRAPE framework
GRAPE solves both tasks via a Graph Neural Network.
GRAPE is inspired by GraphSAGE with three innovations:
- We use augmented node features to initialize observation and feature nodes,
which provides greater representation power and maintains inductive learning capabilities.
- We employ an edge dropout technique that greatly improves model generalization.
- We introduce edge embeddings during message passing to fully utilize the rich attribute information.
Figure 2: GRAPE forward computation
Key Results
We experiment on 9 UCI datasets from different domains.
We compare GRAPE against 5 commonly used imputation methods,
and state-of-the-art deep learning models.
We show in Figure 3 that GRAPE yields 20% lower mean absolute error (MAE) for feature imputation,
and 10% lower MAE for label prediction, compared with the best baselines.
Figure 3: Overall comparisions over GRAPE and baselines.
Additionally, we vary the missing data ratio from 10% to 70%.
For both tasks, we observe consistent gains under different missing data ratios.
Figure 4: Comparisions when the missing data raio varies.
Please refer to our paper for detailed explanations and more results.
Code and Datasets
We release
GRAPE on GitHub.
The datasets are included in the code repository.
Contributors
Jiaxuan You*
Xiaobai Ma*
Daisy Yi Ding*
Mykel Kochenderfer
Jure Leskovec
References
Handling Missing Data with Graph Representation Learning. Jiaxuan You*, Xiaobai Ma*, Daisy Yi Ding*, Mykel Kochenderfer, Jure Leskovec.
Neural Information Processing Systems (NeurIPS), 2020.