Stories by Michael Galkin on Medium

Foundation Models in Graph & Geometric Deep Learning

Michael Galkin — Tue, 18 Jun 2024 18:18:06 GMT

Foundation Models in language, vision, and audio have been among the primary research topics in Machine Learning in 2024 whereas FMs for graph-structured data have somewhat lagged behind. In this post, we argue that the era of Graph FMs has already begun and provide a few examples of how one can use them already today.

This post was written and edited by Michael Galkin and Michael Bronstein with significant contributions from Jianan Zhao, Haitao Mao, Zhaocheng Zhu.

The timeline of emerging foundation models in graph- and geometric deep learning. Image by Authors.

What are Graph Foundation Models and how to build them?
Node Classification: GraphAny
Link Prediction: Not yet
Knowledge Graph Reasoning: ULTRA and UltraQuery
Algorithmic Reasoning: Generalist Algorithmic Learner
Geometric and AI4Science Foundation Models
a. ML Potentials: JMP-1, DPA-2 for molecules, MACE-MP-0 and MatterSim for inorganic crystals
b. Protein LMs: ESM-2
c. 2D Molecules: MiniMol and MolGPS
Expressivity & Scaling Laws: Do Graph FMs scale?
The Data Question: What should be scaled? Is there enough graph data to train Graph FMs?
👉 Key Takeaways 👈

What are Graph Foundation Models and how to build them?

Since there is a certain degree of ambiguity in what counts as a “foundational” model, it would be appropriate to start with a definition to establish a common ground:

“A Graph Foundation Model is a single (neural) model that learns transferable graph representations that can generalize to any new, previously unseen graph”

One of the challenges is that graphs come in all forms and shapes and their connectivity and feature structure can be very different. Standard Graph Neural Networks (GNNs) are not “foundational” because they can work in the best case only on graphs with the same type and dimension of features. Graph heuristics like Label Propagation or Personalized PageRank that can run on any graph can neither be considered Graph FMs because they do not involve any learning. As much as we love Large Language Models, it is still unclear whether parsing graphs into sequences that can be then passed to an LLM (like in GraphText or Talk Like A Graph) is a suitable approach for retaining graph symmetries and scaling to anything larger than toy-sized datasets (we leave LLMs + Graphs to a separate post).

Perhaps the most important question for designing Graph FMs is transferable graph representations. LLMs, as suggested in the recent ICML 2024 position paper by Mao, Chen et al., can squash any text in any language into tokens from a fixed-size vocabulary. Video-Language FMs resort to patches that can always be extracted from an image (one always has RGB channels in any image or video). It is not immediately clear what could a universal featurization (à la tokenization) scheme be for graphs, which might have very diverse characteristics, e.g.:

One large graph with node features and some given node labels (typical for node classification tasks)
One large graph without node features and classes, but with meaningful edge types (typical for link prediction and KG reasoning)
Many small graphs with/without node/edge features, with graph-level labels (typical for graph classification and regression)

🦄 An ideal graph foundation model that takes any graph with any node/edge/graph features and performs any node- / edge- / graph-level task. Such Graph FMs do not exist in pure form as of mid-2024. Image by Authors

So far, there is a handful of open research questions for the graph learning community when designing Graph FMs:

1️⃣ How to generalize across graphs with heterogeneous node/edge/graph features? For example, the popular Cora dataset for node classification is one graph with node features of dimension 1,433, whereas the Citeseer dataset has 3,703-dimensional features. How can one define a single representation space for such diverse graphs?

2️⃣ How to generalize across prediction tasks? Node classification tasks may have a different number of node classes (e.g., Cora has 7 classes and Citeseer 6). Even further, can a node classification model perform well in link prediction?

3️⃣ What should the foundational model expressivity be? Much research has been done on the expressive power of GNNs, typically resorting to the analogy with Weisfeiler-Lehman isomorphism tests. Since graph foundational models should ideally handle a broad spectrum of problems, the right expressive power is elusive. For instance, in node classification tasks, node features are important along with graph homophily or heterophily. In link prediction, structural patterns and breaking automorphisms are more important (node features often don’t give a huge performance boost). In graph-level tasks, graph isomorphism starts to play a crucial role. In 3D geometric tasks like molecule generation, there is an additional complexity of continuous symmetries to take care of (see the Hitchhiker’s Guide to Geometric GNNs).

In the following sections, we will show that at least in some tasks and domains, Graph FMs are already available. We will highlight their design choices when it comes to transferable features and practical benefits when it comes to inductive inference on new unseen graphs.

📚Read more in references [1][2] and Github Repo

Node Classification: GraphAny

For years, GNN-based node classifiers have been limited to a single graph dataset. That is, given e.g. the Cora graph with 2.7K nodes, 1433-dimensional features, and 7 classes, one has to train a GNN specifically on the Cora graph with its labels and run inference on the same graph. Applying a trained model on another graph, e.g. Citeseer with 3703-dimensional features and 6 classes would run into an unsurmountable difficulty: how would one model generalize to different input feature dimensions and a different number of classes? Usually, prediction heads are hardcoded to a fixed number of classes.

GraphAny is, to the best of our knowledge, the first Graph FM where a single pre-trained model can perform node classification on any graph with any feature dimension and any number of classes. A single GraphAny model pre-trained on 120 nodes of the standard Wisconsin dataset successfully generalizes to 30+ other graphs of different sizes and features and, on average, outperforms GCN and GAT graph neural network architectures trained from scratch on each of those graphs.

Overview of GraphAny: LinearGNNs are used to perform non-parametric predictions and derive the entropy-normalized distance features. The final prediction is generated by fusing multiple LinearGNN predictions on each node with attention learned based on the distance features. Source: Zhao et al.

Setup: Semi-supervised node classification: given a graph G, node features X, and a few labeled nodes from C classes, predict labels of target nodes (binary or multi-class classification). The dimension of node features and the number of unique classes are not fixed and are graph-dependent.

What is transferable: Instead of modeling a universal latent space for all possible graphs (which is quite cumbersome or maybe even practically impossible), GraphAny bypasses this problem and focuses on the interactions between predictions of spectral filters. Given a collection of high-pass and low-pass filters akin to Simplified Graph Convolutions (for instance, operations of the form AX and (I-A)X, dubbed “LinearGNNs” in the paper) and known node labels:

0️⃣ GraphAny applies filters to all nodes;

1️⃣ GraphAny obtains optimal weights for each predictor from nodes with known labels by solving a least squares optimization problem in closed form (optimal weights are expressed as a pseudoinverse);

2️⃣ Applies the optimal weights to unknown nodes to get tentative prediction logits;

3️⃣ Computes pair-wise distances between those logits and applies entropy regularization (such that different graph- and feature sizes will not affect the distribution). For example, for 5 LinearGNNs, this would result in 5 x 4 = 20 combinations of logit scores;

4️⃣ Learns the inductive attention matrix over those logits to weight the predictions most effectively (e.g., putting more attention to high-pass filters for heterophilic graphs).

In the end, the only learnable component in the model is the parameterization of attention (via MLP), which does not depend on the target number of unique classes, but only on the number of LinearGNNs used. In the same vein, all LinearGNN predictors are non-parametric, their updated node features and optimal weights can be pre-computed beforehand for faster inference.

📚Read more in references [3]

Link Prediction: Not yet

Setup: given a graph G, with or without node features, predict whether a link exists between a pair of nodes (v1, v2)

😢 For graphs with node features, we are not aware of any single transferable model for link prediction.

For non-featurized graphs (or when you decide to omit node features deliberately), there is more to say — basically, all GNNs with a labeling trick potentially can transfer to new graphs thanks to the uniform node featurization strategy.

It is known that in link prediction, the biggest hurdle is the presence of automorphic nodes (nodes that have the same structural roles) — vanilla GNNs assign them the same feature making two links (v1, v2) and (v1, v3) in the image below 👇 indistinguishable. Labeling tricks like Double Radius Node Labeling or Distance Encoding are such node featurization strategies that break automorphism symmetries.

V2 and v3 are automorphic nodes and standard GNNs score (v1,v2) and (v1,v3) equally. When we predict (v1, v2), we will label these two nodes differently from the rest, so that a GNN is aware of the target link when learning v1 and v2’s representations. Similarly, when predicting (v1, v3), nodes v1 and v3 will be labeled differently. This way, the representation of v2 in the left graph will be different from that of v3 in the right graph, enabling GNNs to distinguish the non-isomorphic links (v1, v2) and (v1, v3). Source: Zhang et al.

Perhaps the only approach with a labeling trick (for non-featurized graphs) that was evaluated on link prediction on unseen graphs is UniLP. UniLP is an in-context, contrastive learning model that requires a set of positive and negative samples for each target link to be predicted. Practically, UniLP uses SEAL as a backbone GNN and learns an attention over a fixed number of positive and negative samples. On the other hand, SEAL is notoriously slow, so the first step towards making UniLP scale to large graphs is to replace subgraph mining with more efficient approaches like ELPH and BUDDY.

Overview of the Universal Link Predictor framework. (a) For predicting a query link 𝑞, we initially sample positive (𝑠+) and negative (𝑠-) in-context links from the target graph. Both the query link and these in-context links are independently processed through a shared subgraph GNN encoder. An attention mechanism then calculates scores based on the similarity between the query link and the in-context links. (b) The final representation of the query link, contextualized by the target graph, is obtained through a weighted summation, which combines the representations of the in-context links with their respective labels. Source: Dong et al.

What is transferable: structural patterns learned by labeling trick GNNs — it is proven that methods like Neural Bellman-Ford capture metrics over node pairs, eg, Personalized PageRank or Katz index (often used for link prediction).

Now, as we know how to deal with automorphisms, the only step towards a single graph FM for link prediction would be to add a support for heterogeneous node features — perhaps GraphAny-style approaches might be an inspiration?

📚Read more in references [4][5][6][7]

Knowledge Graph Reasoning: ULTRA and UltraQuery

Knowledge graphs have graph-specific sets of entities and relations, e.g. common encyclopedia facts from Wikipedia / Wikidata or biomedical facts in Hetionet, those relations have different semantics and are not directly mappable to each other. For years, KG reasoning models were hardcoded to a given vocabulary of relations and could not transfer to new, unseen KGs with completely new entities and relations.

ULTRA is the first foundation model for KG reasoning that transfers to any KG at inference time in the zero-shot manner. That is, a single pre-trained model can run inference on any multi-relational graph with any size and entity/relation vocabulary. Averaged over 57 graphs, ULTRA significantly outperforms baselines trained specifically on each graph. Recently, ULTRA was extended to UltraQuery to support even more complex logical queries on graphs involving conjunctions, disjunctions, and negation operators. UltraQuery transfers to unseen graphs and 10+ complex query patterns on those unseen graphs outperforming much larger baselines trained from scratch.

Given a query (Michael Jackson, genre, ?), ULTRA builds a graph of relations (edge types) to capture their interactions in the original graph conditioned on the query relation (genre) and derives relational representations from this smaller graph. Those features are then used as edge type features in the original bigger graph to answer the query. Source: Galkin et al.

Setup: Given a multi-relational graph G with |E| nodes and |R| edge types, no node features, answer simple KG completion queries (head, relation, ?) or complex queries involving logical operators by returning a probability distribution over all nodes in the given graph. The set of nodes and relation types depends on the graph and can vary.

What is transferable: ULTRA relies on modeling relational interactions. Forgetting about relation identities and target graph domain for a second, if we see that relations “authored” and “collaborated” can share the same starting node, and relations “student” and “coauthor” in another graph can share a starting node, then the relative, structural representations of those two pairs of relations might be similar. This holds for any multi-relational graph in any domain, be it encyclopedia or biomedical KGs. ULTRA goes further and captures 4 such “fundamental” interactions between relations. Those fundamental interactions are transferable to any KG (together with learned GNN weights) — this way, one single pre-trained model is ready for inference on any unseen graph and simple or complex reasoning query.

Algorithmic Reasoning: Generalist Algorithmic Learner

A generalist neural algorithmic learner is a single processor GNN P, with a single set of weights, capable of solving several algorithmic tasks in a shared latent space (each of which is attached to P with simple encoders/decoders f and g). Among others, the processor network is capable of sorting (top), shortest path-finding (middle), and convex hull finding (bottom). Source: Ibarz et al.

Setup: Neural algorithmic reasoning (NAR) studies the execution of standard algorithms (eg, sorting, searching, dynamic programming) in the latent space and generalization to the inputs of arbitrary size. A lot of such algorithms can be represented with a graph input and pointers. Given a graph G with node and edge features, the task is to simulate the algorithm and produce the correct output. Optionally, you can get access to hints — time series of intermediate states of the algorithm which can act as the intermediate supervised signal. Obviously, different algorithms require a different number of steps to execute, so length is not fixed here.

What is transferable: Homogeneous feature space and similar control flow for similar algorithms. For instance, Prim’s and Dijkstra’s algorithms share a similar structure, differing only in the choice of key function and edge relaxation subroutine. Besides, there are several proofs of a direct alignment between message passing and dynamic programming. This is the main motivation behind one “processor” neural network that updates latent states for all considered algorithms (30 classic algos from the CLRS book).

Triplet-GMPNN was the first such universal processor neural net (by 2024 it became rather standard in the NAR literature) — it is a GNN that operates on triples of nodes and their features (akin to Edge Transformers and triangular attention in AlphaFold). The model is trained in the multi-task mode on all algorithmic tasks in the benchmark with a handful of optimization and tricks. A single model bumps the average performance on 30 tasks by over 20% (in absolute numbers) compared to single-task specialist models.

Still, encoders and decoders are parameterized specifically for each task — one of the ways to unify the input and output formats might as well be text with LLM processors as done in the recent text version of CLRS.

Top: The graph algorithmic trace of insertion sorting a list [5, 2, 4, 3, 1] in graph form. Bottom: The same algorithmic trace, represented textually, by using the CLRS-Text generator. The model receives as input (depicted in green) the input array (key) and the initial value of the sorting trace (initial_trace), using which it is prompted to predict the trace (depicted in blue) of gradually sorting the list, by inserting one element at a time into a partially sorted list, from left to right. At the end, the model needs to output the final sorted array (depicted in red), and it is evaluated on whether this array is predicted correctly. Source: Markeeva, McLeish, Ibarz, et al.

Perhaps the most interesting question of 2024 and 2025 in NAR is:

Can algorithmic reasoning ideas for OOD generalization be the key to generalizable LLM reasoning?

LLMs notoriously struggle with complex reasoning problems, dozens of papers appear on arxiv every month trying a new prompting method to bump benchmarking performance another percentage-or-two, but most of them do not transfer across tasks of similar graph structures (see the example below). There is a need for more principled approaches and NAR has the potential to fill this gap!

Failure of LLMs on reasoning problems with similar graph structures. Image by Authors.

📚Read more in references [10][11]

Geometric and AI4Science Foundation Models

In the world of Geometric Deep Learning and scientific applications, foundation models are becoming prevalent as universal ML potentials, protein language models, and universal molecular property predictors. Although the universal vocabulary exists in most such cases (e.g., atom types in small molecules or amino acids in proteins) and we do not have to think about universal featurization, the main complexity lies in the real-world physical nature of atomistic objects — they have pronounced 3D structure and properties (like energy), which have theoretical justifications rooted in chemistry, physics, and quantum mechanics.

ML Potentials: JMP-1, DPA-2 for molecules, MACE-MP-0 and MatterSim for inorganic crystals

Setup: given a 3D structure, predict the energy of the structure and per-atom forces;

What is transferable: a vocabulary of atoms from the periodic table.

ML potentials estimate the potential energy of a chemical compound — like molecules or periodic crystals — given their 3D coordinates and optional input (like periodic boundary conditions for crystals). For any atomistic model, the vocabulary of possible atoms is always bound by the Periodic Table which currently includes 118 elements. The “foundational” aspect in ML potentials is to generalize to any atomistic structure (there can be combinatorially many) and be stable enough to be used in molecular dynamics (MD), drug- and materials discovery pipelines.

JMP-1 and DPA-2 released around the same time aim to be such universal ML potential models — they are trained on a sheer variety of structures — from organic molecules to crystals to MD trajectories. For example, a single pre-trained JMP-1 excels at QM9, rMD17 for small molecules, MatBench and QMOF on crystals, and MD22, SPICE on large molecules being on-par or better than specialized per-dataset models. Similarly, MACE-MP-0 and MatterSim are the most advanced FMs for inorganic crystals (MACE-MP-0 is already available with weights) evaluated on 20+ crystal tasks ranging from multicomponent alloys to combustion and molten salts. Equivariant GNNs are at the heart of those systems helping to process equivariant features (Cartesian coordinates) and invariant features (like atom types).

Sources: (1) Pre-training and fine-tuning of JMP-1 for molecules and crystals, Shoghi et al (2) MACE-MP-0 is trained only on the Materials Project data and transfers to molecular dynamics simulation across a wide variety of chemistries in the solid, liquid and gaseous phases, Batatia, Benner, Chiang, Elena, Kovács, Riebesell et al.

The next frontier seems to be ML-accelerated molecular dynamics simulations — traditional computational methods work at the femtosecond scale (10–15) and require millions and billions of steps to simulate a molecule, crystal, or protein. Speeding up such computations would have an immense scientific impact.

📚Read more in references [12][13][14][15]

Protein LMs: ESM-2

Setup: given a protein sequence, predict the masked tokens akin to masked language modeling;

What is transferable: a vocabulary of 20 (22) amino acids.

Protein sequences resemble natural language with amino acids as tokens, and Transformers excel at encoding sequence data. Although the vocabulary of amino acids is relatively small, the space of possible proteins is enormous, so training on large volumes of known proteins might hint at the properties of unseen combinations. ESM-2 is perhaps the most popular protein LM thanks to the pre-training data size, a variety of available checkpoints, and informative features.

ESM2 as a masked LM and ESMFold for protein structure prediction. Source: Lin, Akin, Rao, Hie, et al.

ESM features are used in countless applications from predicting 3D structure (in ESMFold) to protein-ligand binding (DiffDock and its descendants) to protein structure generative models (like a recent FoldFlow 2). Bigger transformers and more data are likely to increase protein LMs’ performance even further — at this scale, however, the data question becomes more prevalent (we also discuss the interplay between architecture and data in the dedicated section), eg, the ESM Metagenomic Atlas already encodes 700M+ structures including those seen outside humans in the soil, oceans, or hydrothermal vents. Is there a way to trillions of tokens as in common LLM training datasets?

📚Read more in references [16][17]

2D Molecules: MiniMol and MolGPS

Setup: given a 2D graph structure with atom types and bond types, predict molecular properties

What is transferable: a vocabulary of atoms from the periodic table and bond types

With 2D graphs (without 3D atom coordinates) universal encoding and transferability come from a fixed vocabulary of atom and bond types which you can send to any GNN or Transformer encoder. Although molecular fingerprints have been used since 1960s (Morgan fingerprints [18]), their primary goal was to evaluate similarity, not to model a latent space. The task of a single (large) neural encoder is to learn useful representations that might hint at certain physical molecular properties.

Recent examples of generalist models for learning molecular representations are MiniMol and MolGPS which have been trained on a large corpus of molecular graphs and probed on dozens of downstream tasks. That said, you still need to fine-tune a separate task-specific decoder / predictor given the models’ representations — in that sense, one single pre-trained model will not be able to run zero-shot inference on all possible unseen tasks, rather on those for which decoders have been trained. Fine-tuning is still a good cheap option though since those models are orders of magnitude smaller than LLMs.

Source: (1) Workflow overview of the MiniMol pre-training and downstream task evaluation. (2) Criteria of the scaling study of MolGPS

📚Read more in references [19][20]

Expressivity & Scaling Laws: Do Graph FMs scale?

Transformers in LLMs and multi-modal frontier models are rather standard and we know some basic scaling principles for them. Do transformers (as an architecture, not LLMs) work equally well on graphs? What are the general challenges when designing a backbone for Graph FMs?

If you categorize the models highlighted in the previous sections, only 2 areas feature transformers — protein LMs (ESM) with a natural sequential bias and small molecules (MolGPS). The rest are GNNs. There are several reasons for that:

Vanilla transformers do not scale to any reasonably large graph larger than a standard context length (>4–10k nodes). Anything above that range requires tricks like feeding only subgraphs (losing the whole graph structure and long-range dependencies) or linear attention (that might not have good scaling properties). In contrast, GNNs are linear in the number of edges, and, in the case of sparse graphs (V ~ E), are linear in the number of nodes.
Vanilla transformers without positional encodings are less expressive than GNNs. Mining positional encodings like Laplacian PEs on a graph with V nodes is O(V³).
What should be a “token” when encoding graphs via transformers? There is no clear winner in the literature, e.g., nodes, nodes + edges, or subgraphs are all viable option

➡️ Touching upon expressivity, different graph tasks need to deal with different symmetries, e.g., automorphic nodes in link prediction lead to indistinguishable representations, whereas in graph classification/regression going beyond 1-WL is necessary for distinguishing molecules which otherwise might look isomorphic to vanilla GNNs.

Different tasks need to deal with different symmetries. Image by Authors. Sources of graphs: (1) Zhang et al, (2) Morris et al

This fact begs two questions:

How expressive should GFMs be? What is the trade-off between expressivity and scalability?

Ideally, we want a single model to resolve all those symmetries equally well. However, more expressive models would lead to more computationally expensive architectures both in training and inference. We agree with the recent ICML’24 position paper on the future directions in Graph ML theory that the community should seek the balance between expressivity, generalization, and optimization.

Still, it is worth noting that with the growing availability of training data, it might be a computationally cheaper idea to defer learning complex symmetries and invariances directly from the data (instead of baking them into a model). A few recent good examples of this thesis are AlphaFold 3 and Molecular Conformer Fields that reach SOTA in many generative applications without expensive equivariant geometric encoders.

📚Read more in references [21]

➡️ When it comes to scaling, both model and data should be scaled up. However:

❌ Non-geometric graphs: There is no principled study on scaling GNNs or Transformers to large graphs and common tasks like node classification and link prediction. 2-layer GraphSAGE is often not very far away from huge 16-layer graph transformers. In a similar trend, in the KG reasoning domain, a single ULTRA model (discussed above) with <200k parameters outperforms million-sized shallow embedding models on 50+ graphs. Why is it happening? We’d hypothesize the crux is in 1️⃣ task nature — most of non-geometric graphs are noisy similarity graphs that are not bounded to a concrete physical phenomenon like molecules 2️⃣ Given rich node and edge features, models have to learn representations of graph structures (common for link prediction) or just functions over given features (a good example is node classification in OGB where most gains are achieved by adding an LLM feature encoder).

✅ Geometric graphs: There are several recent works focusing on molecular graphs:

Frey et al (2023) study scaling of geometric GNNs for ML potentials;
Sypetkowski, Wenkel et al (2024) introduce MolGPS and study scaling MPNNs and Graph Transformers up to 1B parameters on the large dataset of 5M molecules
Liu et al (2024) probe GCN, GIN, and GraphGPS up to 100M parameters on molecular datasets up to 4M molecules.

Scaling molecular GNNs and GTs. Sources: (1) Sypetkowski, Wenkel et al, (2) Liu et al

The Data Question: What should be scaled? Is there enough graph data to train Graph FMs?

1️⃣ What should be scaled in graph data? Nodes? Edges? The number of graphs? Something else?

There is no clear winner in the literature, we would rather gravitate towards a broader term diversity, that is, a diversity of patterns in the graph data. For example, in node classification on large product graphs, it likely would not matter much if you train on a graph with 100M nodes or 10B nodes since it’s the same nature of a user-item graph. However, showing examples with homophily and heterophily on different scales and sparsities might be quite beneficial. In GraphAny, showing examples of such graphs allowed to build a robust node classifier that generalizes to different graph distributions,

In KG reasoning with ULTRA, it was found that the diversity of relational patterns in pre-training plays the biggest role in inductive generalization, e.g., one large dense graph is worse than a collection of smaller but sparse, dense, few-relational, and many-relational graphs.

In molecular graph-level tasks, e.g., in MolGPS, scaling the number of unique molecules with different physical properties helps a lot (as shown in the charts above 👆).

Besides, UniAug finds that increased coverage of the structural patterns in pre-training data adds to the performance across different downstream tasks from various domains.

2️⃣ Is there enough data to train Graph FMs?

Openly available graph data is orders of magnitudes smaller than natural language tokens or images or videos, and it is fine. This very article includes thousands of language and image tokens and no explicit graphs (unless you try to parse this text to a graph like an abstract meaning representation graph). The number of ‘good’ proteins with known structures in PDB is small, the number of known ‘good’ molecules for drugs is small.

Are Graph FMs doomed because of data scarcity?

Well, not really. The two open avenues are: (1) more sample-efficient architectures; (2) using more black-box and synthetic data.

Synthetic benchmarks like GraphWorld might be of use to increase the diversity of training data and improve generalization to real-world datasets. Black-box data obtained from scientific experiments, in turn, is likely to become the key factor in building successful foundation models in AI 4 Science — those who master it will prevail on the market.

The Road to Biology 2.0 Will Pass Through Black-Box Data

📚Read more in references [20][22][23]

👉 Key Takeaways 👈

➡️ How to generalize across graphs with heterogeneous node/edge/graph features?

Non-geometric graphs: Relative information transfers (such as prediction differences in GraphAny or relational interactions in Ultra), absolute information does not.
Geometric graphs: transfer is easier thanks to the fixed set of atoms, but models have to learn some notion of physics to be reliable

➡️ How to generalize across prediction tasks?

To date, there is no single model (among non-geometric GNNs) that would be able to perform node classification, link prediction, and graph classification in the zero-shot inference mode.
Framing all tasks through the lens of one might help, eg, node classification can be framed as link prediction.

➡️ What is the optimal model expressivity?

Node classification, link prediction, and graph classification leverage different symmetries.
Blunt application of maximally expressive models quickly leads to exponential runtime complexity or enormous memory costs — need to maintain the expressivity vs efficiency balance.
The link between expressivity, sample complexity (how much training data you need), and inductive generalization is still unknown.

➡️ Data

Openly available graph data is orders of magnitude smaller than text/vision data, models have to be sample-efficient.
Scaling laws are at the emerging stage, it is still unclear what to scale — number of nodes? Edges? Motifs? What is the notion of a token in graphs?
Geometric GNNs: there is much more experimental data available that makes little sense to domain experts but might be of value to neural nets.

Mao, Chen, et al. Graph Foundation Models Are Already Here. ICML 2024
Morris et al. Future Directions in Foundations of Graph Machine Learning. ICML 2024
Zhao et al. GraphAny: A Foundation Model for Node Classification on Any Graph. Arxiv 2024. Code on Github
Dong et al. Universal Link Predictor By In-Context Learning on Graphs, arxiv 2024
Zhang et al. Labeling Trick: A Theory of Using Graph Neural Networks for Multi-Node Representation Learning. NeurIPS 2021
Chamberlain, Shirobokov, et al. Graph Neural Networks for Link Prediction with Subgraph Sketching. ICLR 2023
Zhu et al. Neural Bellman-Ford Networks: A General Graph Neural Network Framework for Link Prediction. NeurIPS 2021
Galkin et al. Towards Foundation Models for Knowledge Graph Reasoning. ICLR 2024
Galkin et al. Zero-shot Logical Query Reasoning on any Knowledge Graph. arxiv 2024. Code on Github
Ibarz et al. A Generalist Neural Algorithmic Learner LoG 2022
Markeeva, McLeish, Ibarz, et al. The CLRS-Text Algorithmic Reasoning Language Benchmark. arxiv 2024
Shoghi et al. From Molecules to Materials: Pre-training Large Generalizable Models for Atomic Property Prediction. ICLR 2024
Zhang, Liu et al. DPA-2: Towards a universal large atomic model for molecular and material simulation, arxiv 2023
Batatia et al. A foundation model for atomistic materials chemistry, arxiv 2024
Yang et al. MatterSim: A Deep Learning Atomistic Model Across Elements, Temperatures and Pressures, arxiv 2024
Rives et al. Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences. PNAS 2021
Lin, Akin, Rao, Hie, et al. Language models of protein sequences at the scale of evolution enable accurate structure prediction. Science 2023. Code
Morgan HL (1965) The generation of a unique machine description for chemical structures — a technique developed at chemical abstracts service. J Chem Doc 5:107–113.
Kläser, Banaszewski, et al. MiniMol: A Parameter Efficient Foundation Model for Molecular Learning, arxiv 2024
Sypetkowski, Wenkel et al. On the Scalability of GNNs for Molecular Graphs, arxiv 2024
Morris et al. Future Directions in Foundations of Graph Machine Learning. ICML 2024
Liu et al. Neural Scaling Laws on Graphs, arxiv 2024
Frey et al. Neural scaling of deep chemical models, Nature Machine Intelligence 2023

Foundation Models in Graph & Geometric Deep Learning was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Graph & Geometric ML in 2024: Where We Are and What’s Next (Part II — Applications)

Michael Galkin — Tue, 16 Jan 2024 05:59:01 GMT

State-of-the-Art Digest

Graph & Geometric ML in 2024: Where We Are and What’s Next (Part II — Applications)

Following the tradition from previous years, we interviewed a cohort of distinguished and prolific academic and industrial experts in an attempt to summarise the highlights of the past year and predict what is in store for 2024. Past 2023 was so ripe with results that we had to break this post into two parts. This is Part II focusing on applications, see also Part I for theory & new architectures.

Image by Authors with some help from DALL-E 3.

The post is written and edited by Michael Galkin and Michael Bronstein with significant contributions from Dominique Beaini, Nathan Benaich, Joey Bose, Johannes Brandstetter, Bruno Correia, Ahmed Elhag, Kexin Huang, Chaitanya Joshi, Leon Klein, N M Anoop Krishnan, Chen Lin, Andreas Loukas, Santiago Miret, Luca Naef, Liudmila Prokhorenkova, Emanuele Rossi, Hannes Stärk, Alex Tong, Anton Tsitsulin, Petar Veličković, Minkai Xu, and Zhaocheng Zhu.

Geometric ML methods and applications filled the covers of high-profile journals in 2023 (Figure sources: the papers by Wang et al., Viñas et al., Deng et al., Weiss et al., Lagemann et al., Duan et al., and Lam et al.)

Structural Biology (Molecules & Proteins)
a. A Structural Biologist’s Perspective
b. Industrial Perspective
c. Systems Biology
Materials Science (Crystals)
Molecular Dynamics & ML Potentials
Geometric Generative Models (Manifolds)
BIG Graphs, Scalability: When GNNs are too expensive
Algorithmic Reasoning & Alignment
Knowledge Graphs: Inductive Reasoning is Solved?
Temporal Graph Learning
LLMs + Graphs for Scientific Discovery
Cool GNN Applications
Geometric Wall Street Bulletin 💸

The legend we will be using throughout the text:
🔥 hot topics
💡 year’s highlight
🏋️ challenges
➡️ current/next developments
🔮 predictions/speculations
💰 financial transactions

Structural Biology (Molecules & Proteins)

Dominique Beaini (Valence), Joey Bose (Mila & Dreamfold), Michael Bronstein (Oxford), Bruno Correia (EPFL), Michael Galkin (Intel), Kexin Huang (Stanford), Chaitanya Joshi (Cambridge), Andreas Loukas (Genentech), Luca Naef (VantAI), Hannes Stärk (MIT), Minkai Xu (Stanford)

Structural biology was definitely at the forefront of Geometric Deep Learning in 2023.

Following the 2020 discovery of halicin as a potential new antibiotic, in 2023, two new antibiotics were discovered with the help of GNNs! First, it is abaucin (by McMaster and MIT), which targets a stubborn pathogen resistant to many drugs. Second, MIT and Harvard researchers discovered a new structural class of antibiotics where the screening process was supported by ChemProp, a suite of GNNs for molecular property prediction. We also observe a convergence of ML and experimental techniques (“lab-in the-loop”) in the recent work on autonomous molecular discovery (a trend we will also see in the Materials Design in the following sections).

Flow Matching has been one of the biggest generative ML trends of 2023, allowing for faster sampling and deterministic sampling trajectories compared to diffusion models. The most prominent examples of Flow Matching models we have seen in the biological applications are FoldFlow (Bose, Akhound-Sadegh, et al.) for protein backbone generation, FlowSite (Stärk et al.) for protein binding site design, and EquiFM (Song, Gong, et al.) for molecule generation.

Conditional probability paths learned by different versions of FoldFlow, visualizing the rotation trajectory of a single residue by the action of SO(3) on its homogeneous space 𝕊². Figure source: Bose, Akhound-Sadegh, et al.

Efficient Flow Matching on complex geometries with necessary equivariances became possible thanks to a handful of theory papers including Riemannian Flow Matching (Chen and Lipman), Minibatch Optimal Transport (Tong et al), and Simulation-Free Schrödinger bridges (Tong, Malkin, Fatras, et al). A great resource to learn Flow Matching with code examples and notebooks is the TorchCFM repo on GitHub as well as talks by Yaron Lipman, Joey Bose, Hannes Stärk, and Alex Tong.

Diffusion models nevertheless continue to be the main workhorse of generative modeling in structural biology. In 2023, we saw several landmark works: FrameDiff (Yim, Trippe, De Bortoli, Mathieu, et al) for protein backbone generation, EvoDiff (Alamdari et al) for generating protein sequences with discrete diffusion, AbDiffuser (Martinkus et al) for full-atom antibody design with frame averaging and discrete diffusion (and with successful wet lab experiments), DiffMaSIF (Sverrison, Akdel, et al) and DiffDock-PP (Ketata, Laue, Mammadov, Stärk, et al) for protein-protein docking, DiffPack (Zhang, Zhang, et al) for side-chain packing, and the Baker Lab published the RFDiffusion all-atom version (Krishna, Wang, Ahern, et al). Among latent diffusion model (like Stable Diffusion in image generation applications), GeoLDM (Xu et al) was the first for 3D molecule conformations, followed by OmniProt for protein sequence-structure generation.

FrameDiff: parameterization of the backbone frame with rotation, translation, and torsion angle for the oxygen atom. Figure Source: Yim, Trippe, De Bortoli, Mathieu, et al

Finally, Google DeepMind and Isomorphic Labs announced AlphaFold 2.3 — the latest iteration is significantly improving upon the baselines in 3 tasks: docking benchmarks (almost 2× better than DiffDock on the new PoseBusters benchmark), protein-nucleic acid interactions, and antibody-antigen prediction.

Chaitanya Joshi (Cambridge)

💡There have been two emerging trends for biomolecular modeling and design that I am very excited about in 2023:

1️⃣ Going from protein structure prediction to conformational ensemble generation. There were several interesting approaches to the problem, including AlphaFold with MSA clustering, idpGAN, Distributional Graphormer (a diffusion model), and AlphaFold Meets Flow Matching for Generating Protein Ensembles.

2️⃣ Modelling of biomolecular complexes and design of biomolecular interactions among proteins + X: RFdiffusion all-atom and Ligand MPNN, both from the Baker Lab, are representative examples of the trend towards designing interactions. The new in-development AlphaFold report claims that a unified structure prediction model can outperform or match specialised models across solo protein and protein complex structure prediction as well as protein-ligand and protein-nucleic acid co-folding.

“However, for all the exciting methodology development in biomolecular modelling and design, perhaps the biggest lesson for the ML community this year should be to focus more on meaningful in-silico evaluation and, if possible, experimental validation.” — Chaitanya Joshi (Cambridge)

1️⃣ In early 2023, Guolin Ke’s team at DP Technology released two excellent re-evaluation papers highlighting how we may have been largely overestimating the performance of prominent geometric deep learning-based methods for molecular conformation generation and docking w.r.t. traditional baselines.

2️⃣ PoseCheck and PoseBusters shed further light on the failure modes of current molecular generation and docking methods. Critically, generated molecules and their 3D poses are often ‘nonphysical’ and contain steric clashes, hydrogen placement issues, and high strain energies.

3️⃣ Very few papers attempt any experimental validation of new ML ideas. Perhaps collaborating with a wet lab is challenging for those focussed on new methodology development, but I hope that us ML-ers, as a community, will at least be a lot more cautious about the in-silico evaluation metrics we are constantly pushing as we create new models.

Hannes Stärk (MIT)

💡I am reading quite some hype here about Flow Matching, stochastic interpolants, and Rectified Flows (I will call them “Bridge Matching,” or “BM”). I do not think there is much value in just replacing diffusion models with BM in all the existing applications. For pure generative modeling, the main BM advantage is simplicity.

I think we should instead be excited about BM for the new capabilities it unlocks. For example, training bridges between arbitrary distributions in a simulation-free manner (what are the best applications for this? I basically only saw retrosynthesis so far.) or solving OT problems as in DSBM that does so for fluid flow downscaling. Maybe a lot of tools emerged in 2023 (also let us mention BM with multiple marginals), and in 2024, the community will make good use of them?

Joey Bose (Mila & Dreamfold)

💡 This year we have really seen the rise of geometric generative models from theory to practice. A few standouts for me include Riemannian Flow Matching — in general any paper by Ricky Chen and Yaron Lipman on these topics is a must-read — and FrameDiff from Yim et. al which introduced a lot of the important machinery for protein backbone generation. Of course, standing on the shoulders of both RFM and FrameDIff, we built FoldFlow, a cooler flow-matching approach to protein generative models.

“Looking ahead, I foresee a lot more flow matching-based approaches coming into use. They are better for proteins and longer sequences and can start from any source distribution.” — Joey Bose (Mila & Dreamfold)

🔮 Moreover, I suspect we will soon see multi-modal generative models in this space, such as discrete + continuous models and also conditional models in the same vein as text-conditioned diffusion models for images. Perhaps, we might even see latent generative models here given that they scale so well!

Minkai Xu (Stanford)

“This year, the community has further pushed forward the geometric generative models for 3D molecular generation in many perspectives.” — Minkai Xu (Stanford)

Flow matching: Ricky and Yaron proposed the Flow Matching method as an alternative to the widely used diffusion models, and EquiFM (Song et al and Klein et al) realize the variant for 3D molecule generation by parameterizing the flow dynamics with equivariant GNNs. In the meantime, FrameFlow and FoldFlow construct FM models for protein generation.

🔮Moving forward similar to vision and text domain, people begin to explore generation in the lower-dimensional latent space instead of the complex original data space (latent generative models). GeoLDM (Xu et al) proposed the first latent diffusion model (like Stable Diffusion in CV) for 3D molecule generation, while Fu et al enjoys similar modeling formulation for large protein generation.

A Structural Biologist’s Perspective

Bruno Correia (EPFL)

“Current generative models still create “garbage” outputs that violate many of the physical and chemical properties that molecules are known to have. The advantage of current generative models is, of course, their speed which affords them the possibility of generating many samples, which then brings to front and center the ability to filter the best generated samples, which in the case of protein design has benefited immensely from the transformative development of AlphaFold2.” — Bruno Correia (EPFL)

➡️ The next challenge to the community will perhaps be how to infuse generative models with meaningful physical and chemical priors to enhance sampling performance and generalization. Interestingly, we have not seen the same remarkable advances (experimentally validated) in applications to small molecule design, which we hope to see during 2024.

➡️ The rise of multimodal models. Generally in biological-related tasks data sparsity is a given and as such strategies to extract the most signal out of the data are essential. One way to try to overcome such limitations is to improve the expressiveness of the data representations and maybe this way obtain more performant neural networks. Likely in the short term, we will be able to explore architectures that encompass several types of representations of the objects of interest and harness the best predictions for the evermore complex tasks we are facing as progressively more of the basic problems get solved. This notion of multimodality is of course intimately related to the overall aim of having models with stronger priors, that in a generative context, honour fundamental constraints of the objects of interest.

➡️ The models that know everything. As the power of machine learning models improves we clearly tend to request a more multi-objective optimization when it comes to attempting to solve real life problems. Taking as an example small molecule generation, thinking from a biochemical perspective the drug design problem starts by having a target to which a small molecule binds, therefore one of the first and most important constraints is that the generative process ought to be conditioned to the protein pocket. However, such a constraint may not be enough to create real small molecules as many of such chemicals are simply impossible or very hard to synthesize, and, therefore, a model that has notions of chemical synthesizability and can integrate such constraints in the search space would be much more useful.

➡️ From chemotype to phenotype. On the grounds of data representation, atomic graph structures together with vector embeddings have reached remarkable results, particularly in the search for new antibiotics. Making accurate predictions of which chemical structures have antimicrobial activity, broadly speaking, is an exercise of phenotype prediction from chemical structure. Due to the simplicity of the approaches used and the impressive results obtained, one would expect that more sophisticated data representations on the molecule end and perhaps together also with richer phenotype assignment could give critical contributions to such an important problem in drug development.

Industrial perspective

Luca Naef (VantAI)

🔥What are the biggest advancements in the field you noticed in 2023?

1️⃣ Increasing multi-modality & modularity — as shown by the emergence of initial co-folding methods for both proteins & small molecules, diffusion and non-diffusion-based, to extend on AF2 success: DiffusionProteinLigand in the last days of 2022 and RFDiffusion, AlphaFold2 and Umol by end of 2023. We are also seeing models that have sequence & structure co-trained: SAProt, ProstT5, and sequence, structure & surface co-trained with ProteinINR. There is a general revival of surface-based methods after a quieter 2021 and 2022: DiffMasif, SurfDock, and ShapeProt.

2️⃣ Datasets and benchmarks. Datasets, especially synthetic/computationally derived: ATLAS and the MDDB for protein dynamics. MISATO, SPICE, Splinter for protein-ligand complexes, QM1B for molecular properties. PINDER: large protein-protein docking dataset with matched apo/predicted pairs and benchmark suite with retrained docking models. CryoET data portal for CryoET. And a whole host of welcome benchmarks: PINDER, PoseBusters, and PoseCheck, with a focus on more rigorous and practically relevant settings.

3️⃣ Creative pre-training strategies to get around the sparsity of diverse protein-ligand complexes. Van-der-mers training (DockGen) & sidechain training strategies in RF-AA and pre-training on ligand-only complexes in CCD in RF-AA. Multi-task pre-training Unimol and others.

🏋️ What are the open challenges that researchers might overlook?

1️⃣ Generalization. DockGen showed that current state-of-the-art protein-ligand docking models completely lose predictability when asked to generalise towards novel protein domains. We see a similar phenomenon in the AlphaFold-lastest report, where performance on novel proteins & ligands drops heavily to below biophysics-based baselines (which have access to holo structures), despite very generous definitions of novel protein & ligand. This indicates that existing approaches might still largely rely on memorization, an observation that has been extensively argued over the years

2️⃣ The curse of (simple) baselines. A recurring topic over the years, 2023 has again shown what industry practitioners have long known: in many practical problems such as molecular generation, property prediction, docking, and conformer prediction, simple baselines or classical approaches often still outperform ML-based approaches in practice. This has been documented increasingly in 2023 by Tripp et al., Yu et al., Zhou et al.

🔮 Predictions for 2024!

“In 2024, data sparsity will remain top of mind and we will see a lot of smart ways to use models to generate synthetic training data. Self-distillation in AlphaFold2 served as a big inspiration, Confidence Bootstrapping in DockGen, leveraging the insight that we now have sufficiently powerful models that can score poses but not always generate them, first realised in 2022.” — Luca Naef (VantAI)

2️⃣ We will see more biological/chemical assays purpose-built for ML or only making sense in a machine learning context (i.e., might not lead to biological insight by themselves but be primarily useful for training models). An example from 2023 is the large-scale protein folding experiments by Tsuboyama et al. This move might be driven by techbio startups, where we have seen the first foundation models built on such ML-purpose-built assays for structural biology with e.g. ATOM-1.

Andreas Loukas (Prescient Design, part of Genentech)

🔥 What are the biggest advancements in the field you noticed in 2023?

“In 2023, we started to see some of the challenges of equivariant generation and representation for proteins to be resolved through diffusion models.” — Andreas Loukas (Prescient Design)

1️⃣ We also noticed a shift towards approaches that model and generate molecular systems at higher fidelity. For instance, the most recent models adopt a fully end-to-end approach by generating backbone, sequence and side-chains jointly (AbDiffuser, dyMEAN) or at least solve the problem in two steps but with a partially joint model (Chroma); as compared to backbone generation followed by inverse folding as in RFDiffusion and FrameDiff. Other attempts to improve the modelling fidelity can be found in the latest updates to co-folding tools like AlphaFold2 and RFDiffusion which render them sensitive to non-protein components (ligands, prosthetic groups, cofactors); as well as in papers that attempt to account for conformational dynamics (see discussion above). In my view, this line of work is essential because the binding behaviour of molecular systems can be very sensitive to how atoms are placed, move, and interact.

2️⃣ In 2023, many works also attempted to get a handle on binding affinity by learning to predict the effect of mutations of a known crystal by pre-training on large corpora, such as computationally predicted mutations (graphinity), and on side-tasks, such as rotamer density estimation. The obtained results are encouraging as they can significantly outperform semi-empirical baselines like Rosetta and FoldX. However, there is still significant work to be done to render these models reliable for binding affinity prediction.

3️⃣ I have further observed a growing recognition of protein Language Models (pLMs) and specifically ESM as valuable tools, even among those who primarily favour geometric deep learning. These embeddings are used to help docking models, allow the construction of simple yet competitive predictive models for binding affinity prediction (Li et al 2023), and can generally offer an efficient method to create residue representations for GNNs that are informed by the extensive proteome data without the need for extensive pretraining (Jamasb et al 2023). However, I do maintain a concern regarding the use of pLMs: it is unclear whether their effectiveness is due to data leakage or genuine generalisation. This is particularly pertinent when evaluating models on tasks like amino-acid recovery in inverse folding and conditional CDR design, where distinguishing between these two factors is crucial.

🏋️ What are the open challenges that researchers might overlook?

1️⃣ Working with energetically relaxed crystal structures (and, even worse, folded structures) can significantly affect the performance of downstream predictive models. This is especially true for the prediction of protein-protein interactions (PPIs). In my experience, the performance of PPI predictors severely deteriorates when they are given a relaxed structure as opposed to the binding (holo) crystalised structure.

2️⃣ Though successful in silico antibody design has the capacity to revolutionise drug design, general protein models are not (yet?) as good at folding, docking or generating antibodies as antibody-specific models are. This is perhaps due to the low conformational variability of the antibody fold and the distinct binding mode between antibodies and antigens (loop-mediated interactions that can involve a non-negligible entropic component). Perhaps for the same reasons, the de novo design of antibody binders (that I define as 0-shot generation of an antibody that binds to a previously unseen epitope) remains an open problem. Currently, experimentally confirmed cases of de novo binders involve mostly stable proteins, like alpha-helical bundles, that are common in the PDB and harbour interfaces that differ substantially from epitope-paratope interactions.

3️⃣ We are still lacking a general-purpose proxy for binding free energy. The main issue here is the lack of high-quality data of sufficient size and diversity (esp. co-crystal structures). We should therefore be cognizant of the limitations of any such learned proxy for any model evaluation: though predicted binding scores that are out of distribution of known binders is a clear signal that something is off, we should avoid the typical pitfall of trying to demonstrate the superiority of our model in an empirical evaluation by showing how it leads to even higher scores.

Dominique Beaini (Valence Labs, part of Recursion)

“I’m excited to see a very large community being built around the problem of drug discovery, and I feel we are on the brink of a new revolution in the speed and efficiency of discovering drugs.” — Dominique Beaini (Valence Labs)

What work got me excited in 2023?

I am confident that machine learning will allow us to tackle rare diseases quickly, stop the next COVID-X pandemic before it can spread, and live longer and healthier. But there’s a lot of work to be done and there are a lot of challenges ahead, some bumps in the road, and some canyons on the way. Speaking of communities, you can visit the Valence Portal to keep up-to-date with the 🔥 new in ML for drug discovery.

What are the hard questions for 2024?

⚛️ A new generation of quantum mechanics. Machine learning force-fields, often based on equivariant and invariant GNNs, have been promising us a treasure. The treasure of the precision of density functional theory, but thousands of times faster and at the scale of entire proteins. Although some steps were made in this direction with Allegro and MACE-MP, current models do not generalize well to unseen settings and very large molecules, and they are still too slow to be applicable on the timescale that is needed 🐢. For the generalization, I believe that bigger and more diverse datasets are the most important stepping stones. For the computation time, I believe we will see models that are less enforcing of the equivariance, such as FAENet. But efficient sampling methods will play a bigger role: spatial-sampling such as using DiffDock to get more interesting starting points and time-sampling such as TimeWarp to avoid simulating every frame. I’m really excited by the big STEBS 👣 awaiting us in 2024: Spatio-temporal equivariant Boltzmann samplers.

🕸️ Everything is connected. Biology is inherently multimodal 🙋🐁 🧫🧬🧪. One cannot simply decouple the molecule from the rest of the biological system. Of course, that’s how ML for drug discovery was done in the past: simply build a model of the molecular graph and fit it to experimental data. But we have reached a critical point 🛑, no matter how many trillion parameters are in the GNN model is, and how much data are used to train it, and how many experts are mixtured together. It is time to bring biology into the mix, and the most straightforward way is with multi-modal models. One method is to condition the output of the GNNs with the target protein sequences such as MocFormer. Another is to use microscopy images or transcriptomics to better inform the model of the biological signature of molecules such as TranSiGen. Yet another is to use LLMs to embed contextual information about the tasks such as TwinBooster. Or even better, combining all of these together 🤯, but this could take years. The main issue for the broader community seems to be the availability of large amounts of quality and standardized data, but fortunately, this is not an issue for Valence.

🔬 Relating biological knowledge and observables. Humans have been trying to map biology for a long time, building relational maps for genes 🧬, protein-protein interactions 🔄, metabolic pathways 🔀, etc. I invite you to read this review of knowledge graphs for drug discovery. But all this knowledge often sits unused and ignored by the ML community. I feel that this is an area where GNNs for knowledge graphs could prove very useful, especially in 2024, and it could provide another modality for the 🕸️ point above. Considering that human knowledge is incomplete, we can instead recover relational maps from foundational models. This is the route taken by Phenom1 when trying to recall known genetic relationships. However, having to deal with various knowledge databases is an extremely complex task that we can’t expect most ML scientists to be able to tackle alone. But with the help of artificial assistants like LOWE, this can be done in a matter of seconds.

🏆 Benchmarks, benchmarks, benchmarks. I can’t repeat the word benchmark enough. Alas, benchmarks will stay the unloved kid on the ML block 🫥. But if the word benchmark is uncool, its cousin competition is way cooler 😎! Just as the OGB-LSC competition and Open Catalyst challenge played a major role for the GNN community, it is now time for a new series of competitions 🥇. We even got the TGB (Temporal graph benchmark) recently. If you were at NeurIPS’23, then you probably heard of Polaris coming up early 2024 ✨. Polaris is a consortium of multiple pharma and academic groups trying to improve the quality of available molecular benchmarks to better represent real drug discovery. Perhaps we’ll even see a benchmark suitable for molecular graph generation instead of optimizing QED and cLogP, but I wouldn’t hold my breath, I have been waiting for years. What kind of new, crazy competition will light up the GDL community this year 🤔?

Systems Biology

Kexin Huang (Stanford)

Biology is an interconnected, multi-scale, and multi-modal system. Effective modeling of this system can not only unravel fundamental biological questions but also significantly impact therapeutic discovery. The most natural data format for encapsulating this system is a relational database or a heterogeneous graph. This graph stores data from decades of wet lab experiments across various biological modalities, scaling up to billions of data points.

“In 2023, we witnessed a range of innovative applications using GNNs on these biological system graphs. These applications have unlocked new biomedical capabilities and answered critical biological queries.” — Kexin Huang (Stanford)

1️⃣ One particularly exciting field is perturbative biology. Understanding the outcomes of perturbations can lead to advancements in cell reprogramming, target discovery, and synthetic lethality, among others. In 2023, GEARS applies GNN to gene perturbational relational graphs and it predicts outcomes of genetic perturbations that have not been observed before.

2️⃣ Another cool application concerns protein representation. While current protein representations are fixed and static, we recognize that the same protein can exhibit different functions in varying cellular contexts. PINNACLE uses GNN on protein interaction networks to contextualize protein embeddings. This approach has shown to enhance 3D structure-based protein representations and outperform existing context-free models in identifying therapeutic targets.

PINNACLE has protein-, cell type-, and tissue-level attention mechanisms that enable the algorithm to generate contextualized representations of proteins, cell types, and tissues in a single unified embedding space. Source: Li et al

3️⃣ GNNs also have shown a vital role in diagnosing rare diseases. SHEPHERD utilizes GNN over massive knowledge graph to encode extensive biological knowledge into the ML model and is shown to facilitate causal gene discovery, identify ‘patients-like-me’ with similar genes or diseases, and provide interpretable insights into novel disease manifestations.

➡️ Moving beyond predictions, understanding the underlying mechanisms of biological phenomena is crucial. Graph XAI applied to system graphs is a natural fit for identifying mechanistic pathways. TxGNN, for example, grounds drug-disease relation predictions in the biological system graph, generating multi-hop interpretable paths. These paths rationalize the potential of a drug in treating a specific disease. TxGNN designed visualizations for these interpretations and conducted user studies, proving their decision-making effectiveness for clinicians and biomedical scientists.

A web-based graphical user interface to support clinicians and scientists in exploring and analyzing the predictions and explanations generated by TxGNN. The ‘Control Panel‘ allows users to select the disease of interest and view the top-ranked TXGNN predictions for the query disease. The ‘edge threshold‘ module enables users to modify the sparsity of the explanation and thereby control the density of the multi-hop paths displayed. The ‘Drug Embedding‘ panel allows users to compare the position of a selected drug relative to the entire repurposing candidate library. The ‘Path Explanation‘ panel displays the biological relations that have been identified as crucial for TXGNN’s predictions regarding therapeutic use. Source: Huang, Chandar, et al

➡️ Foundation models in biology have predominantly been unimodal (focused on proteins, molecules, diseases, etc.), primarily due to the scarcity of paired data. Bridging across modalities to answer multi-modal queries is an exciting frontier. For example, BioBridge leverages biological knowledge graphs to learn transformations across unimodal foundation models, enabling multi-modal behaviors.

🔮 GNNs applied to system graphs have the potential to (1) encode vast biomedical knowledge, (2) bridge biological modalities, (3) provide mechanistic insights, and (4) contextualize biological entities. We anticipate even more groundbreaking applications of GNN in biology in 2024, addressing some of the most pressing questions in the field.

Predictions from the 2023 post

(1) performance improvements of diffusion models such as faster sampling and more efficient solvers;
✅ yes, with flow matching

(2) more powerful conditional protein generation models;
❌ Chroma and RFDiffusion are still on top

(3) more successful applications of Generative Flow Networks to molecules and proteins
❌ yet to be seen

Materials Science (Crystals)

Michael Galkin (Intel) and Santiago Miret (Intel)

In 2023, for a short period, all scientific news were talking only about LK-99 — a supposed room-temperature superconductor created by a Korean team (spoiler: it did not work as of now).

This highlights the huge potential ML has in material science, where perhaps the biggest progress of the year has happened — we can now say that materials science and materials discovery are first-class citizens in the Geometric DL landscape.

💡The advances of Geometric DL applied to materials science and discovery saw significant advances across new modelling methods, creation of new benchmarks and datasets, automated design with generative methods, and identifying new research questions based on those advances.

1️⃣ Applications of geometric models as evaluation tools in automated discovery workflows. The Open MatSci ML Toolkit consolidated all open-sourced crystal structures datasets leading to 1.5 million data points for ground-state structure calculations that are now easily available for model development. The authors’ initial results seem to indicate that merging datasets seems to improve performance if done attentively.

2️⃣ MatBench Discovery is another good example of this integration of geometric models as an evaluation tool for crystal stability, which tests models’ predictions of the energy above hull for various crystal structures. The energy above hull is the most reliable approximation of crystal structure stability and also represents an improvement in metrics compared to formation energy or raw energy prediction which have practical limitations as stability metrics.

Universal potentials are more reliable classifiers because they exit the red triangle earliest. These lines show the rolling MAE on the WBM test set as the energy to the convex hull of the MP training set is varied, lower is better. The red-highlighted ’triangle of peril’ shows where the models are most likely to misclassify structures. As long as a model’s rolling MAE remains inside the triangle, its mean error is larger than the distance to the convex hull. If the model’s error for a given prediction happens to point towards the stability threshold at 0 eV from the hull (the plot’s center), its average error will change the stability classification of a material from true positive/negative to false negative/positive. The width of the ’rolling window’ box indicates the width over which errors hull distance prediction errors were averaged. Source: Riebesell et al

3️⃣ In terms of new geometric models for crystal structure prediction, Crystal Hamiltonian Graph neural network (CHGNet, Deng et al) is a new GNN trained on static and relaxation trajectories of Materials Project that shows quite competitive performance compared to prior methods. The development of CHGNet suggests that finding better training objectives will be as (if not more) important than the development of new methods as the intersection of materials science and geometric deep learning continues to grow.

🔥 The other proof points of the further integration of Geometric DL and materials discovery are several massive works by big labs focused on crystal structure discovery with generative methods:

1️⃣ Google DeepMind released GNoME (Graph Networks for Materials Science by Merchant et al) as a successful example of an active learning pipeline for discovering new materials, and UniMat as an ab initio crystal generation model. Similar to the protein world, we see more examples of automated labs for materials science (“lab-in-the-loop”) such as the A-Lab from UC Berkley.

The active learning loop of GNoME. Source: Merchant et al.

2️⃣ Microsoft Research released MatterGen, a generative model for unconditional and property-guided materials design, and Distributional Graphormer, a generative model trained to recover the equilibrium energy distribution of a molecule/protein/crystal.

Unconditional and conditional generation of MatterGen. Source: Zeni, Pinsler, Zügner, Fowler, Horton, et al.

3️⃣ Meta AI and CMU released the Open Catalyst Demo where you can play around with relaxations (DFT approximations) of 11.5k catalyst materials on 86 adsorbates in 100 different configurations each (making it up to 100M combinations). The demo is powered by SOTA geometric models GemNet-OC and Equiformer-V2.

Santiago Miret (Intel)

While those works represent large-scale deployments of generative methods, there is also new work on using reinforcement learning (Govindarajan et al., Lacombe et al.) and GFlowNets (Mistal et al., Nguyen et al.) with geometric DL for crystal structure discovery as highlighted in the AI for Accelerated Materials Design (AI4Mat) workshop at NeurIPS’23. AI4Mat-2023 itself saw rapid expansion in participation with a 2× increase in the number of submitted and accepted papers and almost tripling in the number of attendees.

💡 Geometric DL and GNNs continue to be a major part of AI4Mat’s research content as we saw increased application of methods not only for property prediction but also for improving chemical synthesis and material characterization. One such promising example highlighted in the AI4Mat-2023 workshop is KREED (Cheng, Lo, et al), which uses equivariant diffusion to predict 3D structures of molecules based on incomplete information that can be obtained from real laboratory machines.

“Given the importance of structural data in material characterization, the discussions at AI4Mat highlighted the opportunities for Geometric DL to enter the space of real-world materials modelling in addition to their continued successes in simulations including ML-based potentials.” — Santiago Miret (Intel)

🔮 In 2024, I expect to see multiple developments:

1️⃣ More discovery architectures and workflows that directly integrate geometric models like M3GNet, CHGNet, MACE.

2️⃣ Geometric models might also see increased competition from text-based representations and LLMs as new methods are being proposed that directly generate CIF files.

3️⃣ More deployment of geometric models and GNNs into real-world experimental data, likely in materials characterization such as KREED, which will likely run into regimes with less data compared to simulation-based modeling.

Molecular Dynamics & ML Potentials

Michael Galkin (Intel), Leon Klein (FU Berlin), N M Anoop Krishnan (IIT Delhi), Santiago Miret (Intel)

One of the pronounced trends of 2023 is going towards foundation models for ML potentials that work on a variety of compounds from small molecules to periodic crystals

For example, JMP (Shoghi et al) from FAIR and CMU, DPA-2 (Zhang, Liu, et al) from a large collaboration of Chinese institutions, and MACE-MP-0 (Batatia et al) from a collaboration led by Cambridge. Practically, those are geometric GNNs pre-trained in the multi-task mode to predict the energy (or forces) of a certain atomic structure. Another notable mention goes to Equiformer V2 (Liao et al) as a strong equivariant transformer that holds SOTA in many tasks including the recent OpenCatalyst 2023 Challenge and OpenDAC (Direct Air Capture) challenge.

A foundation model for materials modelling. Trained only on Materials Project data which consists primarily of inorganic crystals and is skewed heavily towards oxides, MACE-MP-0 is capable of molecular dynamics simulation across a wide variety of chemistries in the solid, liquid and gaseous phases. Source: Batatia et al

⚛️ A common use case for ML potentials is molecular dynamics (MD) which aims to simulate a certain structure on a span of nanoseconds (10ᐨ⁹) to seconds. The main problem is that the fundamental timestep in classical methods is a femtosecond (10ᐨ¹⁵), that is, you’d need at least 1 million steps to simulate a nanosecond and that’s expensive. Modern ML-based methods for MD aim to speed it up by applying coarse-graining and other approximation tricks that accelerate simulations by large margins (30–1000x). Fu, Xie, et al (TMLR’23) apply coarse-graining to atomic structures and run a GNN over smaller graphs to predict the next-step position. Experimentally, the method brings 1000–10.000x speedups compared to classical methods. TimeWarp (Klein, Foong, Fjelde, Mlodozeniec, et al, NeurIPS’23) can simulate large timesteps (1⁰⁵ — 1⁰⁶ femtoseconds) in a single forward pass by using a conditional normalizing flow model that approximates a distribution of next-step positions. A trained model is used with MCMC sampling and delivers ~33x speedups.

(a) Initial state x(t) (Left) and accepted proposal state x(t+τ) (Right) sampled with Timewarp for the dipeptide HT (unseen during training). (b) TICA projections of simulation trajectories, showing transitions between metastable states, for a short MD simulation (Left) and Timewarp MCMC (Right), both run for 30 minutes of wall-clock time. Timewarp MCMC achieves a speed-up factor of ≈ 33 over MD in terms of effective sample size per second. Source: Klein, Foong, Fjelde, Mlodozeniec, et al

Santiago Miret (Intel)

💡As the deployment of geometric models has seen greater success in property modelling, researchers have pushed the state-of-the-art by testing these models in real-world molecular dynamics simulations. The first work to highlight issues with training models on energy and forces alone was Forces Are Not Enough published in TMLR in early 2023. Nevertheless, advances in neighborhood-based methods such as Allegro led to the successful deployment of large-scale simulations using geometric deep learning models, including a nomination for the Gordon Bell Prize.

“Much work still remains in ensuring successful, generalised deployment of machine learning potentials across a variety of physical and chemical phenomena.” — Santiago Miret (Intel)

➡️ EGraffBench highlights some new challenges, such as generalisation across temperatures and materials phase changes (i.e. solid-to-liquid change), and proposes new metrics for evaluating the performance of machine learning potentials in real MD simulations. The AI4Mat-2023 workshop also showcased the development of new ML potentials for specialised use cases, such as solid electrolytes for batteries.

Leon Klein (FU Berlin)

💡 A notable constraint in the application of generative models to sample from the equilibrium Boltzmann distribution was the requirement for retraining with each new system, thereby limiting potential advantages over traditional MD simulations. However, recent advancements have seen the emergence of transferable models across various domains. Our contribution, Timewarp, presents a transferable model capable of proposing large time steps for MD simulations focused on all atom small peptide systems. Similarly, Fu et al. capture the time-coarsened dynamics of coarse-grained polymers, while Charron et al. excel in learning a transferable force field for coarse-grained proteins.

“Consequently, this year has demonstrated the feasibility of transferable generative models for MD simulations, showcasing their potential to speed up such simulations.” — Leon Klein (FU Berlin)

🔮 In 2024, I expect that more tailored GNNs are used to improve accuracy for the transferable models, with a potential focus on encoding more information about the system. For example, Timewarp, while lacking rotational symmetry in its model, employs data augmentation. Alternatively, rotational symmetry could be incorporated using the recently proposed SE(3) Equivariant Augmented Coupling Flows. Similarly, Charron et al. use a SchNet instead of a more complex GNN.

N M Anoop Krishnan (IIT Delhi)

“One of the most exciting developments for the year in the realm of ML potentials is the development of “universal” interatomic potentials that can span almost all the elements of the periodic table.” — N M Anoop Krishnan (IIT Delhi)

💡 Following M3GNet in 2022, this year witnessed the developments of three such models based on CHGNet (Deng et al), NequIP (Merchant et al), and MACE (Batatia et al). These models have been used to demonstrate several challenging tasks including materials discovery (Merchant et al), and diverse set of MD simulations (Batatia et al) such as phase transition, amorphization, chemical reaction, 2D materials modeling, dissolution, defects, combustion to name a few. These approaches provide promising results towards the universality of these potentials, thereby allowing one to solve challenging problems including the discovery of crystals from their corresponding amorphous structure (Aykol et al), a long-standing open problem in materials.

🏋️ While these potentials do provide a handle to attack some outstanding problems, the challenges remain in understanding the scenarios where these potentials can fail.

1️⃣ Testing these potentials to their limit to understand their capability is an important aspect to understand their limitations. This includes modeling extreme environments such as high pressure and radiation conditions, simulating complex multicomponent systems such as glasses or high-entropy alloys, or simulating different phases of systems such as water or silica would be interesting challenges.

2️⃣ While some of these models have been termed as “foundation” models, emergent behavior associated with FMs has not been demonstrated by them. Most of these models simply show extrapolation capability to potentially unseen regions in the phase space or to novel compositions. Developing truly foundational models in terms of emergent properties would be an interesting challenge.

3️⃣ A third aspect that has been paid less attention to is the ability of these models to simulate at scale. While Allegro has demonstrated some capability in terms of length scales these potentials can achieve, simulating at larger time and length scales with stability while respecting the “universality” shall still remain an open challenge for these potentials.

🔮 What to expect in 2024?

1️⃣ Benchmarking suite: While there exist several benchmarking studies on MD simulations, it is expected that 2024 will witness more formalized efforts in this direction both in terms of datasets and tasks. A standard set of tasks that can automatically evaluate potentials and place them on leaderboards will enable easy ranking of potentials targeted for downstream tasks on different materials such as metals, polymers, or oxides.

2️⃣ Model and dataset development: Further efforts will be made to make ML potentials more compact and efficient in terms of their architectures. Moreover, 2024 will also witness large-scale dataset development that will provide ab initio data for training these potentials.

3️⃣ Differentiable MD/AIMD: Further, it is expected that the developments in differentiable simulations will become a major area of fusing experiments and ab initio simulations towards automated development of interatomic potentials for targeted applications. This year may also see advances in differentiable AIMD with machine learned functionals that may allow economical simulations to scale beyond what it has been able to achieve thus far.

Predictions from the 2023 post

We expect to see a lot more focus on computational efficiency and scalability of GNNs. Current GNN-based force-fields are obtaining remarkable accuracy, but are still 2–3 orders of magnitude slower than classical force-fields and are typically only deployed on a few hundred atoms.

✅ Allegro for the Gordon Bell Prize, Large-scale screening with GNoMe

🔮What to expect in 2024:

1️⃣ More deployment of ML potentials into large-scale MD simulations that showcase new research opportunities and challenges and provide a better idea of what benefits ML potentials provide compared to traditional potentials.

2️⃣ New datasets that outline previously unexplored challenges for ML potentials, such as new materials systems and new physical phenomena for those materials such as phase changes at various temperatures and pressures.

3️⃣ Exploration of multi-scale problems that might draw inspiration from classical techniques.

Geometric Generative Models (Manifolds)

Joey Bose (Mila & Dreamfold) and Alex Tong (Mila & Dreamfold)

While generative ML continued to dominate the field in 2023, it was the popularization of geometric generative models that incorporate geometric priors an interesting trend of the year.

Joey Bose (Mila & Dreamfold)

“This year we saw the burgeoning subfield of geometric generative generative models really take a commanding step forward. With the success of diffusion models and flow matching in images we saw more fundamental contributions to enable Generative AI for geometric data types.“ — Joey Bose (Mila & Dreamfold)

While diffusion models for manifolds existed, this year we really saw them being scaled up with Scaling Riemannian Diffusion Models by Lou et. al and functional approaches in Manifold Diffusion Fields Elhag et. al.

(Left) Visual depiction of a training iteration for a field on the bunny manifold M. (Right) Visual depiction of the sampling process for a field on the bunny manifold. Figure source: Elhag et al.

For Normalizing flow-based methods, Riemannian Flow matching by Chen and Lipman stands at the top of the sea of papers as being the most general framework for FM.

In general, a large theme of geometric generative models involves handling symmetries. Equivariant approaches shone this year, from SE(3) models including EDGI (Brehmer, Bose et. al), SE(3) augmented coupling flows (Midgley et. al), to cool theoretical work on Geometric neural diffusion processes (Mathieu et. al) and important physics-based applications with the paper by Abbot et. al.

Alex Tong (Mila & Dreamfold)

“In 2023 we saw advancement both in terms of modelling and the rise of a new application — Protein backbone design. Much work is still needed to understand the properties of the SE(3)ᴺ₀ type of product manifold, where it is still unclear how to best combine modalities” — Alex Tong (Mila & Dreamfold)

2023 saw new models such as RFDiffusion, FrameDiff, and FoldFlow which operate over the SE(3)ᴺ₀ manifold of protein backbones. This presents a new challenge for geometric generative models which I think we will see significant progress in the coming year.

On the modelling side, generative modelling with flow and bridge matching models in Euclidean domains led to quick succession of Riemannian and equivariant extensions with Riemannian Flow Matching by Chen and Lipman and Equivariant flow matching (Klein et al., Song et al.) on molecule generation tasks.

🔮 What to expect in 2024:

1️⃣ More exploration into modelling the SE(3)ᴺ₀ manifold following successes in protein backbone design.

2️⃣ Further investigation and theory of how to train generative models on multimodal and product manifolds.

3️⃣ Domain-specific models exploiting features of more specific manifold and equivariant structures.

BIG Graphs, Scalability: When GNNs are too expensive

Anton Tsitsulin (Google)

This year has been fruitful for large graph fans.

“Learning on Very Large Graphs has always been a challenge due to the unstructured sparsity not being supported by modern accelerators, losing in the hardware lottery. Tensor Processing Units — you can think about them as very fast GPUs with tons (multi-terabyte) of HBM memory — were the rescue of 2023.” — Anton Tsitsulin (Google)

In a KDD paper (Mayer et al.), we showed that TPUs can solve large-scale node embedding problems more efficiently than GPU and CPU systems at a fraction of the cost. Many industrial applications of graph machine learning are fully unsupervised; there, it is hard to evaluate embedding quality. We wrote a paper (Tsitsulin et al.) that performs unsupervised embedding analysis at scale.

Scale of TpuGraphs compared to other graph property prediction datasets. Source: Phothilimthana et al.

➡️ This year, TPUs helped graph machine learning, so it was time to give back. We released a new TpuGraphs dataset (Phothilimthana et al.) and ran a Kaggle competition “Google — Fast or Slow? Predict AI Model Runtime” on it that showed how to improve learning models running on TPUs with graph machine learning. It had 792 Competitors, 616 Teams, and 10,507 Entries. The dataset provides 25x more graphs than the largest graph property prediction dataset (with comparable graph sizes), and 770x larger graphs on average compared to existing performance prediction datasets on machine learning programs. This dataset is so large, a new algorithm for doing graph-level predictions on large-scale graphs had to be developed by Cao et al.

➡️ Large-scale graph clustering has seen significant contributions this year. A new approximation algorithm (Cohen-Addad et al.) was proposed for correlation clustering improving the approximation factor from 1.994 to the whopping 1.73. TeraHAC (Dhulipala et al) is a major improvement over last year’s ParHAC (that we covered in the 2023 post) — an approximate (1+𝝐) hierarchical agglomerative clustering algorithm for trillion-edge graphs. The largest graph used in the experiments is a massive Web-Query graph with 31B nodes and 8.6 trillion edges 👀. Notable mentions also go to the fastest (to date) algorithm for Euclidean minimum spanning tree (Jayaram et al) and a new near-linear time algorithm for approximating the Chamfer distance between point sets (Bakshi et al.).

🔮 What to expect in 2024:

1️⃣ Algorithmic advances will help scale other popular graph algorithms

2️⃣ Novel hardware usage will help scaling up different graph models

Predictions from the 2023 post

(1) further reduction in compute costs and inference time for very large graphs
✅ We observed order-of-magnitude speedups in clustering and node embedding.

(2) Perhaps models for OGB LSC graphs could run on commodity machines instead of huge clusters?
❌ solid no

Algorithmic Reasoning & Alignment

Petar Veličković (Google DeepMind) and Liudmila Prokhorenkova (Yandex Research)

Algorithmic reasoning, a class of ML techniques able to execute algorithmic computation, has continued to make stable progress during 2023.

Petar Veličković (Google DeepMind)

“2023 has been a year of steady progress for neural algorithmic reasoning models — it indeed remains one of the areas where GNN development gets most creative — probably because it has to be.” — Petar Veličković (Google DeepMind)

Aside from the already discussed asynchronous algorithmic alignment work, there are three results we achieved this year that I am personally proudest of:

1️⃣ DAR showed that pre-trained multi-task neural algorithmic reasoners can be scalably deployed to downstream graph problems — even if they are 180,000x larger than the synthetic training distribution of the NAR. What’s more, we set the state-of-the-art in modelling mouse brain vessels 🐁🧠🩸. NAR is not a victim of the bitter lesson! 📈

2️⃣ Hint-ReLIC 🗿was our response to the rich body of research in no-hint models. We go away from the issue-ridden hint autoregression and instead model hint invariants using causal reasoning. We obtain a potent hint-based NAR, which still holds state-of-the-art on broad patches of CLRS-30! “Hints can take you a long way, if used in the right way.”

3️⃣ Last but not least, we took the plunge and made the first in-depth analysis of the latent space representations of trained NAR models. What we found was not only immensely beautiful to look at 🌺 but it also taught us a great deal about how these models work.

Left: Trajectory-wise PCA of eight clusters of reweighted graphs showing that they all contain a single dominant direction. Different clusters have different colors. Middle: Many embedding clusters with dominant directions overlaid in red. Right: Step-wise PCA of random graphs with the dominant cluster directions overlaid in red. Source: Mirjanić, Pascanu, Veličković

Beyond growing our vibrant community, I find it important to state that many of NAR’s foundational ideas are at the crux of important LLM methodologies; to name just one example, hint following is directly related to chain-of-thought prompting.

💡 What I am most happy about is that in 2023, this link is getting explicit recognition, and ideas from NAR are now directly or indirectly influencing the most potent AI systems in use today. Indeed, NAR is listed as a key motivation for studying length generalisation, and more broadly generalisation on the unseen (ICML’23 Best Paper Award). CLRS-30, the flagship NAR benchmark, is directly used to evaluate capabilities of LLMs in neural architecture search and general AI research. And, as a final cherry on top, CLRS-30 is recognised as one of only seven reasoning evaluations used by Gemini, a frontier large language model from Google DeepMind. I am hopeful that this is a beacon of things to come in 2024, and that we will see even more ideas from NAR break into the design of frontier scalable AI models.

Liudmila Prokhorenkova (Yandex Research)

Throughout the year, substantial progress has been achieved on the path towards endowing models with various algorithmic inductive biases: the use of dual problems (Numeroso et al), contrastive learning techniques (Bevilacqua et al; Rodionov et al), augmentation of models with data structures (Jürß et al; Jain et al), and in-depth examination of computational models (Engelmayer et al). Another important direction is evaluating existing models in terms of scalability and data diversity (Minder et al).

“In 2024 it would be great to see more comprehensive analysis and understanding of neural reasoners: which operations they learn, how sensitive they are to different shifts in data distributions, what types of mistakes they tend to make and why.” — Liudmila Prokhorenkova (Yandex Research)

Gaining such insights may contribute to the development of even more robust and scalable models. Furthermore, robust neural reasoners have the potential to positively impact combinatorial optimization models.

Predictions from the 2023 post

(1) Algorithmic reasoning tasks are likely to scale to graphs of thousands of nodes and practical applications like in code analysis or databases
✅ yes, DAR scales to the OGB vessel size

(2) even more algorithms in the benchmark
✅ yes, SALSA-CLRS

(3) most unlikely — there will appear a model capable of solving quickselect
❌ still unsolved ;(

Knowledge Graphs: Inductive Reasoning is Solved?

Michael Galkin (Intel) and Zhaocheng Zhu (Mila & Google)

Since its inception in 2011, the grand challenge of KG representation learning was truly inductive reasoning when a single model would be able to run inference (eg, missing link prediction) on any graph without input features and without learning hard-coded entity/relation embedding matrices. GraIL (ICML’20) and Neural Bellman-Ford Nets (NeurIPS’21) were instrumental in extending inference to unseen entities, but generalization to both new entities and relation types at inference time remained an unsolved challenge due to the main question: what can be learned and transferred when the whole entity/relation vocabulary can change?

🔮 Our prediction for 2023 (an inductive model fully transferable to different KGs with new sets of entities and relations, e.g., training on Wikidata, and running inference on DBpedia or Freebase) came true in several works:

Gao et al introduced the concept of double equivariance that forces the neural net to be equivariant to permutations of both node IDs and relation IDs. The proposed ISDEA++ model employs a DSS-GNN-like aggregation of a relation-induced subgraph and a subgraph induced by all other relation types.
ULTRA introduced by Galkin et al learns the invariance of relation interactions (captured by a graph of relations) and transfers to absolutely any multi-relational graph. ULTRA achieves SOTA results on dozens of transductive and inductive datasets even in the zero-shot inference setup. Besides, it enables a foundation model-like approach for KG reasoning with generic pre-training, zero-shot inference, and task-specific fine-tuning.

Three main steps taken by ULTRA: (1) building a relation graph; (2) running conditional message passing over the relation graph to get relative relation representations; (3) use those representations for inductive link predictor GNN on the entity level. Source: Galkin et al

Learn more about inductive reasoning in the recent blog post:

ULTRA: Foundation Models for Knowledge Graph Reasoning

As the grand challenge seems to be solved now, is there anything left for KG research, or we should call it a day, throw a party, and move on?

Michael Galkin (Intel)

“Indeed, with the grand challenge solved, it feels a bit like an existential crisis — everything important is invented, Graph ML enabled things that looked impossible just 5 years ago. Perhaps, KG community should re-invent itself and focus on practical problems that can be tackled with graph foundation models. Otherwise, the subfield would disappear from research radars like Semantic Web” — Michael Galkin (Intel)

Transductive and shallow KG embeddings are dead and nobody in 2024 should work on them, it is time to retire them for good. ULTRA-like foundation models can now work without training on any graph which is a sweet spot for many closed enterprise KGs.

➡️ The last uncharted territory is inductive reasoning beyond simple link prediction (complex database-like logical queries) and I think it will also be solved in 2024. Adding temporal aspects, LLM node features, or scaling GNNs for larger graphs is a question of time and presents more of an engineering task than a research question.

Zhaocheng Zhu (Mila & Google)

“With the rise of LLMs and numerous prompt-based reasoning techniques, it looks like KG reasoning is coming to an end. Texts are more expressive and flexible than KGs, and meanwhile they are more available in quantity. However, I don’t think the reasoning techniques that the KG community developed are in vain.” — Zhaocheng Zhu (Mila & Google)

➡️ We see that many LLM reasoning methods coincide with well-known ideas on KGs. For instance, the difference between direct prompting and chain-of-thought (CoT) shares much spirit with embedding methods and path-based methods on KGs, where the latter ones parameterize smaller steps and thereby generalize better to new combinations of steps. In fact, topics like inductive and multi-step generalization were explored on KGs several years earlier than on LLMs.

When we develop new techniques for LLMs, it is essential to take a glance at similar goals and solutions on KGs. In brief, while the modality of KGs may fade at some point, the insights we learned from KG reasoning will continue to illuminate in the era of LLMs.

Temporal Graph Learning

Shenyang Huang, Emanuele Rossi, Andrea Cini, Ingo Scholtes, and Michael Galkin prepared a separate overview post on temporal graph learning!

Temporal Graph Learning in 2024

LLMs + Graphs for Scientific Discovery

Michael Galkin (Intel)

💡LLMs were everywhere in 2023 and it’s hard to miss the 🐘 in the room.

“We have seen a flurry of approaches trying to marry graphs with LLMs. The subfield is emerging and making its tiny baby steps which are important to acknowledge.” — Michael Galkin (Intel)

We have seen a flurry of approaches trying to marry graphs with LLMs (sometimes literally verbalizing the edges in a text prompt) where straightforward prompting with edge index does not really work for running graph algorithms with language models, so the crux is in the “text linearization” and proper prompting. Among the notable mentions, you might be interested in GraphText by Zhao et al that devises a graph syntax tree prompt constructed from features and labels in the ego-subgraph of a target node — GraphText works for node classification. In Talk Like a Graph by Fatemi et al the authors study graph linearization strategies and how they impact LLM performance on basic tasks like edge existence, node count, or cycle check.

Standard GNNs (left) and GraphText (right). GraphText encodes the graph information into text sequences and uses LLM to perform inference. The graph-syntax tree contains both node attributes (e.g. feature and label) and relationships (e.g. center-node, 1st-hop, and 2nd-hop). Source: Zhao et al

➡️ Despite the early stage, there exist already 3 recent surveys (Li et al, Jin et al, Sun et al) covering dozens of prompting approaches for graphs. Generally, it is yet to be seen whether LLMs are an appropriate hammer 🔨 for a specific graph nail given all the limitations of the autoregressive decoding, small context sizes, and permutation-invariant nature of graph tasks. If you are broadly interested in LLM reasoning, check out our recent blog post covering the main areas and progress made in 2023.

➡️ LLMs in applied scientific tasks exhibit more promising, sometimes quite unexpected results: ChemCrow 🐦‍⬛ by Bran, Cox, et al is an LLM agent powered with tools that can perform tasks in organic chemistry, synthesis, and material design right in natural language (without fancy equivariant GNNs). For example, with a query “Find and synthesize a thiourea organocatalyst which accelerates a Diels-Alder reaction” ChemCrow devises a sequence of actions starting from a basic SMILES string and ending up with instructions to a synthesis platform.

Similarly, Gruver et al fine-tuned LLaMA-2 to generate 3D crystal structures as a plain text file with lattice parameters, atomic composition, and 3D coordinates and it is surprisingly competitive with SOTA geometric diffusion models like CDVAE.

Experimental validation. a) Example of the script run by a user to initiate ChemCrow. b) Query and synthesis of a thiourea organocatalyst. c) The IBM Research RoboRXN synthesis platform on which the experiments were executed (pictures reprinted courtesy of International Business Machines Corporation). d) Experimentally validated compounds. Source: Bran, Cox, et al

🔮 In 2024, scientific applications of LLMs are likely to expand both breadth-wise and depth-wise:

1️⃣ Reaching out to more AI4Science areas;

2️⃣ Integration with geometric foundation models (since multi-modality is the main LLM focus for the coming year);

3️⃣ Hot take: LLMs will solve the quickselect task in the CLRS-30 benchmark before GNNs do 🔥

Cool GNN Applications

Petar Veličković (Google DeepMind)

In my standard deck motivating the use of GNNs to a broader audience, I rely on a usual “arsenal” slide of impactful GNN applications over the years. With 2023 being significantly marked by LLM developments, I was wondering — can I meaningfully update this slide, but only using models released this year?

“It was the middle of the year back then, and already I was in for a nice surprise; I did not have enough space to list all the awesome things done with GNNs!” — Petar Veličković (Google DeepMind)

💡 While it might have gone comparatively under the radar, I confidently claim that 2023 was the most exciting year for cool GNN applications! The rise of LLMs just made it very clear where the limits of text-based autoregressive models are, and that for most scientific problems coming from Nature, their graph structure cannot be ignored.

Here’s a handful of my personal favourite landmark results — all published in top-tier venues:

GraphCast provided us a landmark model for medium-range global weather forecasting ⛈️ and with it, more accurate foreshadowing of extreme events such as hurricanes. A highly well-deserved cover of Science!
In an outstanding development in materials science, GNoME uses a GNN-based model to discover millions of novel crystal structures 💎 — an “order-of-magnitude expansion in stable materials known to humanity”. Published in Nature.
We’ve been treated to not just one, but two new breakthroughs in antibiotic discovery 💊 using message passing neural networks — the latter being published in Nature!
GNNs can smell 👃 by observing the molecular structure emitting an odour — a result that may well revolutionise many industries, including perfumes! Published in Science.
On the cover of Nature Machine Intelligence, HYFA 🍄 shows how to use hypergraph factorisation to make significant progress in gene expression imputation 🧬!
Last but not least, particle physics ⚛️ remains a natural stronghold of GNN applications. In this year’s Nature Physics Review, we have been treated to a fascinating survey elucidating the myriad of ways how graph neural networks are deployed for various data analysis tasks at the Large Hadron Collider ⚡.

⚽ My own humble contribution to the space of GNN applications this year was TacticAI, the first full AI system giving useful tactical suggestions to (association) football coaches, developed in partnership with our collaborators at Liverpool FC 🔴. TacticAI is capable of both predictive modelling (“what will happen in this tactical scenario?”), retrieving similar tactics, and conditional generative modelling (“how to modify player positions to make a particular outcome happen?”). In my opinion, the most satisfying part of this very fun collaboration was our user study with some of LFC’s top coaching staff — directly illustrating that the outputs of our model will be of use to coaches in their work 🏃.

A “bird’s eye” overview of TacticAI. (A), how corner kick situations are converted to a graph representation. Each player is treated as a node in a graph, with node, edge and graph features extracted as detailed in the main text. Then, a graph neural network operates over this graph by performing message passing; each node’s representation is updated using the messages sent to it from its neighbouring nodes. (B), how TacticAI processes a given corner kick. To ensure that TacticAI’s answers are robust in the face of horizontal or vertical reflections, all possible combinations of reflections are applied to the input corner, and these four views are then fed to the core TacticAI model, where they are able to interact with each other to compute the final player representations — each “internal blue arrow” corresponds to a single message passing layer from (A). Once player representations are computed, they can be used to predict the corner’s receiver, whether a shot has been taken, as well as assistive adjustments to player positions and velocities, which increase or decrease the probability of a shot being taken. Source: Wang, Veličković, Hennes et al.

This is what I’m all about — AI systems that significantly augment human abilities. I can only hope that, in my home country, Partizan catches on to these methods before Red Star does! 😅

🔮 What will we see in 2024? Probably more of the same, just accelerated! ⏩

Geometric Wall Street Bulletin 💸

Nathan Benaich (AirStreet Capital), Michael Bronstein (Oxford), and Luca Naef (VantAI)

2023 started with BioNTech (mostly known to the broad public for developing mRNA SARS-CoV-2 vaccines) announcing the acquisition of InstaDeep, a decade-old British company focused on AI-powered drug discovery, design and development. In May 2023, Recursion acquired two startups, Cyclica and Valence “to bolster chemistry and generative AI capabilities”. Valence ML team is well-known for multiple works in the geometric and graph ML and hosting the Graphs & Geometry and Molecular Modeling & Drug Discovery seminars on YouTube.

💰Isomorphic Labs started 2024 by announcing small molecule-focused collaborations with Eli Lilly and Novartis with upfront payments of $45M and $37.5M, respectively, with the potential worth of $3 billion.

💰VantAI partnered with Blueprint Medicines on innovative proximity modulating therapeutics, including molecular glue and hetero-bifunctional candidates. The deal’s potential worth is $1.25 billion.

💰CHARM Therapeutics raised more funding from NVIDIA and from Bristol Myers Squibb totalling the initial funding round to $70M. The company has developed DragonFold, its proprietary algorithm for protein-ligand co-folding.

💊 Monte Rosa announced a successful Phase 1 study of MRT-2359 (orally bioavailable investigational molecular glue degrader) against MYC-driven tumors like lung cancer and neuroendocrine cancer. Monte Rosa is known to use geometric deep learning for proteins (MaSIF).

Nathan Benaich (AirStreet Capital, author of the State of AI Report)

“I have long been optimistic about the potential of AI-first approaches to design problems in medicine, biotech, and materials science. Graph-based models had a great year in techbio in 2023.” — Nathan Benaich (AirStreet Capital)

RFdiffusion combines diffusion techniques with GNNs to predict protein structures. It denoises blurry or corrupted structures from the Protein Data Bank, while tapping into RoseTTAFold’s prediction capabilities. DeepMind have continued to further develop AlphaFold and build on top of it. Their AlphaMissense uses weak labels, language modeling, and AlphaFold to predict the pathogenicity of 71 million human variants. This is an important achievement, as most amino acid changes from genetic variation have unknown effects.

Beyond proteins, graph-based models have been improving our understanding of genetics. Stanford’s GEARS system integrates deep learning with a gene interaction knowledge graph to predict gene expression changes from combinatorial perturbations. By leveraging prior data on single and double perturbations, GEARS can predict outcomes for thousands of gene pairs.

GEARS can predict new biologically meaningful phenotypes. (a) Workflow for predicting all pairwise combinatorial perturbation outcomes of a set of genes. (b) Low-dimensional representation of postperturbation gene expression for 102 one-gene perturbations and 128 two-gene perturbations used to train GEARS. A random selection is labeled. (c) GEARS predicts postperturbation gene expression for all 5,151 pairwise combinations of the 102 single genes seen experimentally perturbed. Predicted postperturbation phenotypes (non-black symbols) are often different from phenotypes seen experimentally (black symbols). Colors indicate Leiden clusters labeled using marker gene expression. Source: Roohani et al

🔮 In 2024, I put hope in two different developments.

1️⃣ We have seen the first two CRISPR-Cas9 therapies approved in the US and the UK. These genome editors were discovered through sequencing and random experimentation. I am excited about the use of AI models to design and create bespoke editors on demand.

2️⃣ We have started to see multimodality come to the AI bio world — combining DNA, RNA, protein, cellular, and imaging data to give us a more holistic understanding of biology.

Companies to watch in 2024

Profluent — LLMs for protein design
Inceptive.bio — founded by one of the authors of the Transformers paper.
Enveda Biosciences
Orbital Materials
Kumo.AI
VantAI — we are biased (Michael Bronstein is Vant’s Chief Scientist and Luca Naef is a founder and CTO), but this is a cool company focused on the rational design of molecular glues using a combination of ML and proprietary experimental technology, which we believe to be the right combination for success.
Future House — a new Silicon Valley-based non-profit company in the AI4Science space funded by ex-Google CEO Eric Schmidt. Head of Science is Andrew White, known for his works on LLMs for chemistry. The self-described mission of the company is a “moonshot to build an AI scientist.”

For additional articles about geometric and graph deep learning, see Michael Galkin’s and Michael Bronstein’s Medium posts and follow the two Michaels (Galkin and Bronstein) on Twitter.

Graph & Geometric ML in 2024: Where We Are and What’s Next (Part II — Applications) was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Graph & Geometric ML in 2024: Where We Are and What’s Next (Part I — Theory & Architectures)

Michael Galkin — Tue, 16 Jan 2024 00:02:09 GMT

State-of-the-Art Digest

Graph & Geometric ML in 2024: Where We Are and What’s Next (Part I — Theory & Architectures)

Following the tradition from previous years, we interviewed a cohort of distinguished and prolific academic and industrial experts in an attempt to summarise the highlights of the past year and predict what is in store for 2024. Past 2023 was so ripe with results that we had to break this post into two parts. This is Part I focusing on theory & new architectures, see also Part II on applications.

Image by Authors with some help from DALL-E 3.

The post is written and edited by Michael Galkin and Michael Bronstein with significant contributions from Johannes Brandstetter, İsmail İlkan Ceylan, Francesco Di Giovanni, Ben Finkelshtein, Kexin Huang, Chaitanya Joshi, Chen Lin, Christopher Morris, Mathilde Papillon, Liudmila Prokhorenkova, Bastian Rieck, David Ruhe, Hannes Stärk, and Petar Veličković.

Theory of Graph Neural Networks
1. Message passing neural networks and Graph Transformers
2. Graph components, biconnectivity & planarity
3. Aggregation functions & uniform expressivity
4. Convergence & zero-one laws of GNNs
5. Descriptive complexity of GNNs
6. Fine-grained expressivity of GNNs
7. Expressivity results for Subgraph GNNs
8. Expressivity for Link Prediction and Knowledge Graphs
9. Over-squashing & Expressivity
10. Generalization and Extrapolation capabilities of GNNs
11. Predictions time!
New and Exotic Message Passing
Beyond Graphs
1. Topology
2. Geometric Algebras
3. PDEs
Robustness & Explainability
Graph Transformers
New Datasets & Benchmarks
Conferences, Courses & Community
Memes of 2023

The legend we will be using throughout the text:
💡 - year’s highlight
🏋️ - challenges
➡️ - current/next developments
🔮- predictions/speculations

Theory of Graph Neural Networks

Michael Bronstein (Oxford), Francesco Di Giovanni (Oxford), İsmail İlkan Ceylan (Oxford), Chris Morris (RWTH Aachen)

Message Passing Neural Networks & Graph Transformers

Graph Transformers are a relatively recent trend in graph ML, trying to extend the successes of Transformers from sequences to graphs. As far as traditional expressivity results go, these architectures do not offer any particular advantages. In fact, it is arguable that most of their benefits in terms of expressivity (see e.g. Kreuzer et al.) come from powerful structural encodings rather than the architecture itself and such encodings can in principle be used with MPNNs.

In a recent paper, Cai et al. investigate the connection between MPNNs and (graph) Transformers showing that an MPNN with a virtual node — an auxiliary node that is connected to all other nodes in a specific way — can simulate a (graph) Transformer. This architecture is non-uniform, i.e., the size and structure of the neural networks may depend on the size of the input graphs. Interestingly, once we restrict our attention to linear Transformers (e.g., Performer) then there is a uniform result: there exists a single MPNN using a virtual node that can approximate a linear transformer such as Performer on any input of any size.

Figure from Cai et al.: (a) MPNN with a virtual node, (b) a Transformer.

This is related to the discussions on whether graph transformer architectures present advantages for capturing long-range dependencies when compared to MPNNs. Graph transformers are compared to MPNNs that include a global computation component through the use of virtual nodes, which is a common practice. Cai et al. empirically show that MPNNs with virtual nodes can surpass the performance of graph transformers on the Long-Range Graph Benchmark (LRGB, Dwivedi et al.) Moreover, Tönshoff et al. re-evaluate MPNN baselines on the LRGB benchmark to find out that the earlier reported performance gap in favor of graph transformers was overestimated due to suboptimal hyperparameter choices, essentially closing the gap between MPNNs and graph Transformers.

Figure from Lim et al.: SignNet pipeline.

It is also well-known that common Laplacian positional encodings (e.g., LapPE), are not invariant to the changes of signs and basis of eigenvectors. The lack of invariance makes it easier to obtain (non-uniform) universality results, but these models do not compute graph invariants as a consequence. This has motivated a body of work this year, including the study of sign and basis invariant networks (Lim et al., 2023a) and sign equivariant networks (Lim et al., 2023b). These findings suggest that more research is necessary to theoretically ground the claims commonly found in the literature regarding the comparisons of MPNNs and graph transformers.

Graph components, biconnectivity, and planarity

Figure originally by Zyqqh at Wikipedia.

Zhang et al. (2023a) brings the study of graph biconnectivity to the attention of graph ML community. There are many results presented by Zhang et al. (2023a) relative to different biconnectivity metrics. It has been shown that standard MPNNs cannot detect graph biconnectivity unlike many existing higher-order models (i.e., those that can match the power of 2-FWL). On the other hand, Graphormers with certain distance encodings and subgraph GNNs such as ESAN can detect graph biconnectivity.

Figure from Dimitrov et al. (2023): LHS shows the graph decompositions (A-C) and RHS shows the associated encoders (D-F) and the update equation (G).

Dimitrov et al. (2023) rely on graph decompositions to develop dedicated architectures for learning with planar graphs. The idea is to align with a variation of the classical Hopcroft & Tarjan algorithm for planar isomorphism testing. Dimitrov et al. (2023) first decompose the graph into its biconnected and triconnected components, and afterwards learn representations for nodes, cut nodes, biconnected components, and triconnected components. This is achieved using the classical structures of Block-Cut Trees and SPQR Trees which can be computed in linear time. The resulting framework is called PlanE and contains architectures such as BasePlanE. BasePlanE computes isomorphism-complete graph invariants and hence it can distinguish any pair of planar graphs. The key contribution of this work is to design architectures for efficiently learning complete invariants of planar graphs while remaining practically scalable. It is worth noting that 3-FWL is known to be complete on planar graphs (Kiefer et al., 2019), but this algorithm is not scalable.

Aggregation functions: A uniform expressiveness study

It was broadly argued that different aggregation functions have their place, but this had not been rigorously proven. In fact, in the non-uniform setup, sum aggregation with MLPs yields an injective mapping and as a result subsumes other aggregation functions (Xu et al., 2020), which builds on earlier results (Zaheer et al., 2017). The situation is different in the uniform setup, where one fixed model is required to work on all graphs. Rosenbluth et al. (2023) show that sum aggregation does not always subsume other aggregations in the uniform setup. If, for example, we consider an unbounded feature domain, sum aggregation networks cannot even approximate mean aggregation networks. Interestingly, even for the positive results, where sum aggregation is shown to approximate other aggregations, the presented constructions generally require a large number of layers (growing with the inverse of the approximation error).

Convergence and zero-one laws of GNNs on random graphs

GNNs can in principle be applied to graphs of any size following training. This makes an asymptotic analysis in the size of the input graphs very appealing. Previous studies of the asymptotic behaviour of GNNs have focused on convergence to theoretical limit networks (Keriven et al., 2020) and their stability under the perturbation of large graphs (Levie et al., 2021).

In a recent study, Adam-Day et al. (2023) proved a zero-one law for binary GNN classifiers. The question being tackled is the following: How do binary GNN classifiers behave as we draw Erdos-Rényi graphs of increasing size with random node features? The main finding is that the probability that such graphs are mapped to a particular output by a class of GNN classifiers tends to either zero or to one. That is, the model eventually maps either all graphs to zero or all graphs to one. This result applies to GCNs as well as to GNNs with sum and mean aggregation.

The principal import of this result is that it establishes a novel uniform upper bound on the expressive power of GNNs: any property of graphs which can be uniformly expressed by these GNN architectures must obey a zero-one law. An example of a simple property which does not asymptotically tend to zero or one is that of having an even number of nodes.

The descriptive complexity of GNNs

Grohe (2023) recently analysed the descriptive complexity of GNNs in terms of Boolean circuit complexity. The specific circuit complexity class of interest is TC0. This class contains all languages which are decided by Boolean circuits with constant depth and polynomial size, using only AND, OR, NOT, and threshold (or, majority) gates. Grohe (2023) proves that the graph functions that can be computed by a class of polynomial-size bounded-depth family of GNNs lie in the circuit complexity class TC0. Furthermore, if the class of GNNs are allowed to use random node initialization and global readout as in Abboud el al. (2020) then there is a matching lower bound in that they can compute exactly the same functions that can be expressed in TC0. This establishes an upper bound on the power of GNNs with random node features, by requiring the class of models to be of bounded depth (fixed #layers) and of size polynomial. While this result is still non-uniform, it improves the result of Abboud el al. (2020) where the construction can be worst-case exponential.

A fine-grained expressivity study of GNNs

Numerous recent works have analyzed the expressive power of MPNNs, primarily utilizing combinatorial techniques such as the 1-WL for the graph isomorphism problem. However, the graph isomorphism objective is inherently binary, not giving insights into the degree of similarity between two given graphs. Böker et al. (2023) resolve this issue by deriving continuous extensions of both 1-WL and MPNNs to graphons. Concretely, they show that the continuous variant of 1-WL delivers an accurate topological characterization of the expressive power of MPNNs on graphons, revealing which graphs these networks can distinguish and the difficulty level in separating them. They provide a theoretical framework for graph and graphon similarity, combining various topological variants of classical characterizations of the 1-WL. In particular, they characterize the expressive power of MPNNs in terms of the tree distance, which is a graph distance based on the concept of fractional isomorphisms, and substructure counts via tree homomorphisms, showing that these concepts have the same expressive power as the 1-WL and MPNNs on graphons. Interestingly, they also validated their theoretical findings by showing that randomly initialized MPNNs, without training, show competitive performance compared to their trained counterparts.

Expressiveness results for Subgraph GNNs

Subgraph-based GNNs were already a big trend in 2022 (Bevilacqua et al., 2022, Qian et al., 2022). This year, Zhang et al. (2023b) established more fine-grained expressivity results for such architectures. The paper investigates subgraph GNNs via the so-called Subgraph Weisfeiler-Leman Tests (SWL). Through this, they show a complete hierarchy of SWL with strictly growing expressivity. Concretely, they define equivalence classes for SWL-type algorithms and show that almost all existing subgraph GNNs fall in one of them. Moreover, the so-called SSWL achieves the maximal expressive power. Interestingly, they also relate SWL to several existing expressive GNNs architectures. For example, they show that SWL has the same expressivity as the local versions of 2-WL (Morris et al., 2020). In addition to theory, they also show that SWL-type architectures achieve good empirical results.

Expressive power of architectures for link prediction on KGs

The expressive power of architectures such as RGCN and CompGCN for link prediction on knowledge graphs has been studied by Barceló et al. (2022). This year, Huang et al. (2023) generalized these results to characterize the expressive power of various other model architectures.

Figure from Huang et al. (2023): The figure compares the respective mode of operations in R-MPNNs and C-MPNNs.

Huang et al. (2023) introduced the framework of conditional message passing networks (C-MPNNs) which includes architectures such as NBFNets. Classical relational message passing networks (R-MPNNs) are unary encoders (i.e., encoding graph nodes) and rely on a binary decoder for the task of link prediction (Zhang, 2021). On the other hand, C-MPNNs serve as binary encoders (i.e., encoding pairs of graph nodes) and as a result, are more suitable for the binary task of link prediction. C-MPNNs are shown to align with a relational Weisfeiler-Leman algorithm that can be seen as a local approximation of 2WL. These findings explain the superior performance of NBFNets and alike over, e.g., RGCNs. Huang et al. (2023) also present uniform expressiveness results in terms of precise logical characterizations for the class of binary functions captured by C-MPNNs.

Over-squashing and expressivity

Over-squashing is a phenomenon originally described by Alon & Yahav in 2021 as the compression of exponentially-growing receptive fields into fixed-size vectors. Subsequent research (Topping et al., 2022, Di Giovanni et al., 2023, Black et al., 2023, Nguyen et al., 2023) has characterised over-squashing through sensitivity analysis, proving that the dependence of the output features on hidden representations from earlier layers, is impaired by topological properties such as negative curvature or large commute time. Since the graph topology plays a crucial role in the formation of bottlenecks, graph rewiring, a paradigm shift elevating the graph connectivity to design factor in GNNs, has been proposed as a key strategy for alleviating over-squashing (if you are interested, see the Section on Exotic Message Passing below).

For the given graph, the MPNN learns stronger mixing (tight springs) for nodes (v, u) and (u, w) since their commute time is small, while nodes (u, q) and (u, z), with high commute-time, have weak mixing (loose springs). Source: Di Giovanni et al., 2023

Over-squashing is an obstruction to the expressive power, for it causes GNNs to falter in tasks with long-range interactions. To formally study this, Di Giovanni et al., 2023 introduce a new metric of expressivity, referred to as “mixing”, which encodes the joint and nonlinear dependence of a graph function on pairs of nodes’ features: for a GNN to approximate a function with large mixing, a necessary condition is allowing “strong” message exchange between the relevant nodes. Hence, they postulate to measure over-squashing through the mixing of a GNN prediction, and prove that the depth required by a GNN to induce enough mixing, as required by the task, grows with the commute time — typically much worse than the shortest-path distance. The results show how over-squashing hinders the expressivity of GNNs with “practical” size, and validate that it arises from the misalignment between the task (requiring strong mixing between nodes i and j) and the topology (inducing large commute time between i and j).

The “mixing” of a function pertains to the exchange of information between nodes, whatever this information is, and not to its capacity to separate node representations. In fact, these results also hold for GNNs more powerful than the 1-WL test. The analysis in Di Giovanni et al., (2023) offers an alternative approach for studying the expressivity of GNNs, which easily extends to equivariant GNNs in 3D space and their ability to model interactions between nodes.

Generalization and extrapolation capabilities of GNNs

The expressive power of MPNNs has achieved a lot of attention in recent years through its connection to the WL test. While this connection has led to significant advances in understanding and enhancing MPNNs’ expressive power (Morris et al, 2023a), it does not provide insights into their generalization performance, i.e., their ability to make meaningful predictions beyond the training set. Surprisingly, only a few notable contributions study MPNNs’ generalization behaviors, e.g., Garg et al. (2020), Kriege et al. (2018), Liao et al. (2021), Maskey et al. (2022), Scarselli et al. (2018). However, these approaches express MPNNs’ generalization ability using only classical graph parameters, e.g., maximum degree, number of vertices, or edges, which cannot fully capture the complex structure of real-world graphs. Further, most approaches study generalization in the non-uniform regime, i.e., assuming that the MPNNs operate on graphs of a pre-specified order.

Figure from Morris et al. (2023b): Overview of the generalization capabilities of MPNNs and their link to the 1-WL.

Hence, Morris et al. (2023b) showed a tight connection between the expressive power of the 1-WL and generalization performance. They investigate the influence of graph structure and the parameters’ encoding lengths on MPNNs’ generalization by tightly connecting 1-WL’s expressivity and MPNNs’ Vapnik–Chervonenkis (VC) dimension. To that, they show several results.

1️⃣ First, in the non-uniform regime, they show that MPNNs’ VC dimension depends tightly on the number of equivalence classes computed by the 1-WL over a set of graphs. In addition, their results easily extend to the k-WL and many recent expressive MPNN extensions.

2️⃣ In the uniform regime, i.e., when graphs can have arbitrary order, they show that MPNNs’ VC dimension is lower and upper bounded by the largest bitlength of its weights. In both the uniform and non-uniform regimes, MPNNs’ VC dimension depends logarithmically on the number of colors computed by the 1-WL and polynomially on the number of parameters. Moreover, they also empirically show that their theoretical findings hold in practice to some extent.

🔮 Predictions time!

Christopher Morris (RWTH Aachen)

“I believe that there is a pressing need for a better and more practical theory of generalization of GNNs. ” — Christopher Morris (RWTH Aachen)

➡️ For example, we need to understand how graph structure and various architectural parameters influence generalization. Moreover, the dynamics of SGD for training GNNs are currently understudied and not well understood, and more works will study this.

İsmail İlkan Ceylan (Oxford)

“I hope to see more expressivity results in the uniform setting, where we fix the parameters of a neural network and examine its capabilities.” — İsmail İlkan Ceylan (Oxford)

➡️ In this case, we can identify a better connection to generalization, because if a property cannot be expressed uniformly then the model cannot generalise to larger graph sizes.

➡️ This year, we may also see expressiveness studies that target graph regression or graph generation, which remain under-explored. There are good reasons to hope for learning algorithms which are isomorphism-complete on larger graph classes, strictly generalizing the results for planar graphs.

➡️ It is also time to develop a theory for learning with fully relational data (i.e., knowledge hypergraphs), which will unlock applications in relational databases!

Francesco Di Giovanni (Oxford)

In terms of future theoretical developments of GNNs, I can see two directions that deserve attention.

“There is very little understanding of the dynamics of the weights of a GNN under gradient flow (or SGD); assessing the impact of the graph topology on the evolution of the weights is key to addressing questions about generalisation and hardness of a task.” — Francesco Di Giovanni (Oxford)

➡️ Second, I believe it would be valuable to develop alternative paradigms of expressivity, which more directly focus on approximation power (of graph functions and their derivatives) and identify precisely the tasks which are hard to learn. The latter direction could also be particularly meaningful for characterising the power of equivariant GNNs in 3D space, where measurements of expressivity might need to be decoupled from the 2D case in order to be better aligned with tasks coming from the scientific domain.

At the end: a fun fact about where WL went in 2023

Portraits: Ihor Gorsky

Predictions from the 2023 post

(1) More efforts on creating time- and memory-efficient subgraph GNNs.
❌ not really

(2) Better understanding of generalization of GNNs
✅ yes, see the subsections on oversquashing and generalization

(3) Weisfeiler and Leman visit 10 new places!
❌ (4 so far) Grammatical, indifferent, measurement modeling, paths

New and exotic message passing

Ben Finkelshtein (Oxford), Francesco Di Giovanni (Oxford), Petar Veličković (Google DeepMind)

Petar Veličković (Google DeepMind)

Over the years, it has become part of common folklore that the development of message passing operators has saturated. What I find particularly exciting about the progress made in 2023 is that, from several independent research groups (including our own), a unified novel direction has emerged: let’s start considering the impact of time in the GNN ⏳.

“I forecast that, in 2024, time will assume a central role in the development of novel GNN architectures.” — Petar Veličković (Google DeepMind)

💡 Time has already been leveraged in GNN design when it is explicitly provided in the input (in spatiotemporal or fully dynamic graphs). This year, it has started to feature in research of GNN operators on static graph inputs. Several works are dropping the assumption of a unified, synchronised clock ⏱️ which forces all messages in a layer to be sent and received at once.

1️⃣ The first such work, GwAC 🥑, only played with rudimentary randomised message scheduling, but provided proofs for why such processing might yield significant improvements in expressive power. Co-GNNs 🤝 carry the torch further, demonstrating a more elaborate and fine-tuned message scheduling mechanism which is node-centric, allowing each node to choose when to send 📨 or receive 📬 messages. Co-GNNs also provide a practical method for training such schedulers by gradient descent. While the development of such asynchronous GNN models is highly desirable, we must also acknowledge the associated scalability issues — our present frontier hardware is not designed to efficiently scale such sequential systems.

2️⃣ In our own work on asynchronous algorithmic alignment, we instead opt to design a synchronous GNN, but constrain its message, aggregation, and update functions such that the GNN would yield identical embeddings even if parts of its dataflow were made asynchronous. This led us to an exciting journey through monoids, 1-cocycles, and category theory, resulting in a scalable GNN model that achieves superior performance on many CLRS-30 tasks.

A possible execution trace of an asynchronous GNN. While traditional GNNs send and receive all messages synchronously, under our framework, at any step the GNN may choose to execute any number of possible operations (depicted here with a collection on the right side of the graph). Source: Dudzik et al.

➡️ Lastly, it is worth noting that for certain special choices of message scheduling, we do not need to make modifications to synchronous GNNs’ architecture — and may instead resort to dynamic graph rewiring. DREW and Half-Hop are two concurrently published papers at ICML’23 which embody the principle of using graph rewiring to slow down message passing 🐌. In DREW, a message from each node is actually sent to every other node, but it takes k layers before a message will reach a neighbour that is k hops away! Half-Hop, on the other hand, takes a more lenient approach, and just randomly decides whether or not to introduce a “slow node” which extends the path between any two nodes connected by an edge. Both approaches naturally alleviate the oversmoothing problem, as messages travelling longer distances will oversmooth less.

Whether it is used for message passing design, GNN dataflow or graph rewiring, in 2023 we have just started to grasp the importance of time — even when time variation is not explicitly present in our dataset.

Ben Finkelshtein (Oxford)

The time-dependent message passing paradigm presented in Co-GNNs is a learnable generalisation of message passing, which allows each node to decide how to propagate information from or to its neighbours, thus enabling a more flexible flow of information. The nodes are regarded as players that can either broadcast to neighbors that listen and listen to neighbors that broadcast (like in classical message-passing), Broadcast to neighbors that listen, or Isolate (neither listen nor broadcast).

The interplay between these actions and the ability to change them locally and dynamically allows CoGNNs to determine a task-specific computational graph (which can be considered as a form of dynamic and directed rewiring, learn different action distribution for two nodes with different node features (both feature- and structure-based). CoGNNs allow asynchronous updates across nodes and also yield unique node identifiers with high probability, which allows them to distinguish any pair of graphs (more expressive than 1-WL, at the expense of equivariance holding only in expectation).

Left to right: classical MPNNs (all nodes broadcast & listen), DeepSets (all nodes isolate), and generic CoGNNs. Figure from blog post.

Check the Medium post for more details:

Co-operative Graph Neural Networks

Francesco Di Giovanni (Oxford)

“The understanding of over-squashing, arising when the task depends on the interaction between nodes with large commute time, acted as a catalyst for the emergence of graph rewiring as a valid approach for designing new GNNs.” — Francesco Di Giovanni (Oxford)

️💡 Graph rewiring broadly entails altering the connectivity of the input graph to facilitate the solution of the downstream task. Recently, this has often targeted bottlenecks in the graph, thereby adding (and removing) edges to improve the flow of information. While the emphasis has been on where messages are exchanged, recent works (discussed above) have shed light on the relevance of when messages should be exchanged as well. One rationale behind these approaches, albeit often implicit, is that the hidden representations built by the layers of a GNN, provide the graph with an (artificially) dynamic component, even though the graph and input features are static. This perspective can be leveraged in several ways.

In the classicical MPNN setting, at every layer information only travels from a node to its immediate neighbours. In DRew, the graph changes based on the layer, with newly added edges connecting nodes at distance r from layer r − 1 onward. Finally, in νDRew, we also introduce a delay mechanism equivalent to skip-connections between different nodes based on their mutual distance. Source: Gutteridge et al.

➡️ One framework that has particularly embraced such an angle is DRew, which extends any message-passing model in two ways: (i) it connects nodes at distance r directly, but only from layer r onwards; (ii) when nodes are connected, a delay is applied to their message exchange, based on their mutual distance. As the figure above illustrates, (i) allows the network to better retain the inductive bias, as nodes that are closer, interact earlier; (ii) instead acts as distance-aware skip connections, thereby facilitating the propagation of gradients for the loss. Most likely, it is for this reason, and not prevention of over-smoothing (which hardly has an impact for graph-level tasks), that the framework significantly enhances the performance of standard GNNs at larger depths (more details can be found in this blog post).

🔮 Predictions: I believe that the deep implications of extending message-passing over the “time” component would start to emerge in the coming year. Works like DRew have only scratched the surface of why rewiring over time (beyond space) might benefit the training of GNNs, drastically affecting their accuracy response across different depth regimes.

➡️ More broadly, I hope that theoretical and practical developments of graph rewiring could be translated into scientific domains, where equivariant GNNs are often applied to 3D problems which either do not have a natural graph structure (making the question of “where” messages should be exchanged ever more relevant) or (and) exhibit natural temporal (multi-scale) properties (making the question of “when” messages should be exchanged likely to be key for reducing memory constraints and retaining the right inductive bias).

Geometry, Topology, Geometric Algebras & PDEs

Johannes Brandstetter (JKU Linz), Michael Galkin (Intel), Mathilde Papillon (UC Santa Barbara), Bastian Rieck (Helmholtz & TUM), and David Ruhe (U Amsterdam)

2023 brought the most comprehensive introduction to (and a survey of) Geometric GNNs covering the most basic and necessary concepts with a handful of examples: A Hitchhiker’s Guide to Geometric GNNs for 3D Atomic Systems (Duval, Mathis, Joshi, Schmidt, et al.). If you ever wanted to learn from scratch the core architectures powering recent breakthroughs of graph ML in protein design, material discovery, molecular simulations, and more — this is what you need!

Timeline of key Geometric GNNs for 3D atomic systems, characterised by the type of intermediate representations within layers. Source: Duval, Mathis, Joshi, Schmidt, et al.

Topology

💡 Working with topological structures in 2023 has become much easier for both researchers and practitioners thanks to the amazing efforts of the PyT team and their suite of resources: TopoNetX, TopoModelX, and TopoEmbedX. TopoNetX is pretty much the networkx for topological data. TopoNetX supports standard structures like cellular complexes, simplicial complexes, and combinatorial complexes. TopoModelX is a PyG-like library for deep learning on topological data and implements famous models like MPSN and CIN with a neat unified interface (the original PyG implementations are quite tangled). TopoEmbedX helps to train embedding models on topological data and supports core algorithms like Cell2Vec.

Domains: Nodes in blue, (hyper)edges in pink, and faces in dark red. Source: TopoNetX, Papillon et al

💡 A great headstart to the field and basic building blocks of those topological networks are the papers by Hajij et al and by Papillon et al. A notable chunk of models was implemented by the members of the Topology, Algebra, and Geometry in Data Science (TAG) community that regularly organizes topological workshops at ML conferences.

Mathilde Papillon (UCSB)

“Until 2023, the field of topological deep learning featured a fractured landscape of enriched representations for relational data.” — Mathilde Papillon (UC Santa Barbara)

➡️ Message-passing models were only built upon and benchmarked against other models of the same domain, e.g., the simplicial complex community remained insular to the hypergraph community. To make matters worse, most models adopted a unique mathematical notation. Deciding which model would be best suited to a given application seemed like a monumental task. A unification theory proposed by Hajij et al offered a general scheme under which all models could be systematically described and classified. We applied this theory to the literature to produce a comprehensive yet concise survey of message passing in topological deep learning that also serves as an accessible introduction to the field. We additionally provide a dictionary listing all the model architectures in one unifying notation.

➡️ To further unify the field, we organized the first Topological Deep Learning Challenge, hosted at the 2023 ICML TAG workshop and recorded via this white paper by Papillon et al. The goal was to foster reproducible research by crowdsourcing the open-source implementation of neural networks on topological domains. As part of the challenge, participants from around the world contributed implementations of pre-existing topological deep learning models in TopoModelX. Each submission was rigorously unit-tested and included benchmark training on datasets loaded from TopoNetX. It is our hope that this one-stop-shop suite of consistently implemented models will help practitioners test-drive topological methods for new applications and developments in 2024.

Bastian Rieck (Helmholtz & TUM)

2023 was an exciting year for topology-driven machine learning methods. On the one hand, we saw more integrations with geometrical concepts like curvature, thus demonstrating the versatility of hybrid geometrical-topological models. For instance, in ‘Curvature Filtrations for Graph Generative Model Evaluation,’ we showed how to employ curvature as a way to select suitable graph generative models. Here, curvature serves as a ‘lens’ that we use to extract graph structure information, while we employ persistent homology, a topological method, to compare this information in a consistent fashion.

An overview of the pipeline for evaluating graph generative models using discrete curvature. The ordering on edges gives rise to a curvature filtration, followed by a corresponding persistence diagram and landscape. For graph generative models, we select a curvature, apply this framework element-wise, and evaluate the similarity of the generated and reference distributions by comparing their average landscapes. Source: Southern, Wayland, Bronstein, and Rieck.

➡️ Another direction that serves to underscore that topology-driven methods are becoming a staple in graph learning research uses topology to assess the expressivity of graph neural network models. Sometimes, as in a very fascinating work from NeurIPS 2023 by Immonen et al. this even leads to novel models that leverage both geometrical and topological aspects of graphs in tandem! My own research also aims to contribute to this facet by specifically analyzing the expressivity of persistent homology in graph learning.

“2023 also was the cusp of moving away — or beyond — persistent homology. Despite being rightfully seen as the paradigmatic algorithm for topology-driven machine learning, algebraic topology and differential topology offer an even richer fabric that can be used to analyse data.” — Bastian Rieck (Helmholtz & TUM)

➡️ With my great collaborators, we started looking at some alternatives very recently and came up with the concept of neural differential forms. Differential forms permit us to elegantly build a bridge between geometry and topology by means of the de Rham cohomology — a way to link the integration of certain objects (differential forms), i.e. a fundamentally geometric operation, to topological characteristics of input data. With some additional constructions, the de Rham cohomology permits us to learn geometric descriptions of graphs (or higher-order combinatorial complexes) and solve learning tasks without having to rely on message passing. The upshot are models with fewer parameters that are potentially more effective at solving such tasks. There’s more to come here, since we have just started scratching the surface!

🔮My hopeful predictions for 2024 are that we will:

1️⃣ see many more diverse tools from algebraic and differential topology applied to graphs and combinatorial complexes,

2️⃣ better understand message passing on higher-order input data, and

3️⃣ finally obtain better parallel algorithms for persistent homology to truly unleash its power in a deep learning setting. A recent paper on spectral sequences by Torras-Casas reports some very exciting results that show the great prospects of this technique.

Geometric Algebras

Johannes Brandstetter (JKU Linz) and David Ruhe (U Amsterdam)

“In 2023, we saw the subfield of deep learning on geometric algebras (also known as Clifford algebras) take off. Previously, neural network layers formulated as operations on Clifford algebra multivectors were introduced by Brandstetter et al. This year, the ‘geometric’ in ‘geometric algebra’ was clearly put into action.” — Johannes Brandstetter (JKU Linz) and David Ruhe (U Amsterdam)

➡️ First, Ruhe et al. applied the quintessence of modern (plane-based) geometric algebra by introducing Geometric Clifford Algebra Networks (GCAN), neural network templates that model symmetry transformations described by various geometric algebras. We saw an intriguing application thereof by Pepe et al. in CGAPoseNet, building a geometry-aware pipeline for camera pose regression. Next, Ruhe et al. introduced Clifford Group Equivariant Neural Networks (CGENN), building steerable O(n)- and E(n)-equivariant (graph) neural networks of any dimension via the Clifford group. Pepe et al. apply CGENNs to a Protein Structure Prediction (PSP) pipeline, increasing prediction accuracies by up to 2.1%.

CGENNs (represented with ϕ) are able to operate on multivectors (elements of the Clifford algebra) in an O(n)- or E(n)-equivariant way. Specifically, when an action ρ(w) of the Clifford group, representing an orthogonal transformation such as a rotation, is applied to the data, the model’s representations corotate. Multivectors can be decomposed into scalar, vector, bivector, trivector, and even higher-order components. These elements can represent geometric quantities such as (oriented) areas or volumes. The action ρ(w) is designed to respect these structures when acting on them. Source: Ruhe et al.

➡️ Coincidently, Brehmer et al. formulated Geometric Algebra Transformer(GATr), a scalable Transformer architecture that harnesses the benefits of representations provided by the projective geometric algebra and the scalability of Transformers to build E(3)-equivariant architectures. The GATr architecture was extended to other algebras by Haan et al. who also examine which flavor of geometric algebra is best suited for your E(3)-equivariant machine learning problem.

Overview of the GATr architecture. Boxes with solid lines are learnable components, those with dashed lines are fixed. Source: Brehmer et al.

🔮 In 2024, we can expect exciting new applications from these advancements. Some examples include the following.

1️⃣ We can expect explorations of their applicability to molecular data, drug design, neural physics emulations, crystals, etc. Other geometry-aware applications include 3D rendering, pose estimations, and planning for, e.g., robot arms.

2️⃣ We can expect the extension of geometric algebra-based networks to other neural network architectures, such as convolutional neural networks.

3️⃣ Next, the generality of the CGENN allows for explorations in other dimensions, e.g., 2D, but also in settings where data of various dimensionalities should be processed together. Further, they enable non-Euclidean geometries, which have several use cases in relativistic physics.

4️⃣ Finally, GATr and CGENN can be extended and applied to projective, conformal, hyperbolic, or elliptic geometries.

PDEs

Johannes Brandstetter (JKU Linz)

Concerning the landscape of neural PDE modelling, what topics have surfaced or gathered momentum through 2023?

1️⃣ To begin, there is a noticeable trend towards modelling PDEs on and within intricate geometries, necessitating a mesh-based discretization of space. This aligns with the overarching goal to address increasingly realistic real world problems. For example, Li et al. have introduced Geometry-Informed Neural Operator (GINO) for large-scale 3D PDEs.

2️⃣ Secondly, the development of neural network surrogates for Lagrangian-based simulations is becoming increasingly intriguing. The Lagrangian discretization of space uses finite material points which are tracked as fluid parcels through space and time. The most prominent Lagrangian discretization scheme is called smoothed particle hydrodynamics (SPH), which is the numerical baseline in the LagrangeBench benchmark dataset provided by Toshev et al.

Time snapshots of our datasets, at the initial time (top), 40% (middle), and 95% (bottom) of the trajectory. Color temperature represents velocity magnitude. (a) Taylor Green vortex (2D and 3D), (b) Reverse Poiseuille flow (2D and 3D), © Lid-driven cavity (2D and 3D), (d) Dam break (2D). Source: LagrangeBench by Toshev et al.

3️⃣ Thirdly, diffusion-based modelling is also not stopping for PDEs. We roughly see two directions. The first direction recasts the iterative nature of the diffusion process into a refinement of a candidate state initialised from noise and conditioned on previous timesteps. This iterative refinement was introduced in PDE-Refiner (Lippe et al.) and a variant thereof was already applied in GenCast (Price et al.). The second direction exerts the probabilistic nature of diffusion models to model chaotic phenomena such as 3D turbulence. Examples of this can be found in Turbulent Flow Simulation (Kohl et al.) and in From Zero To Turbulence (Lienen et al.). Especially for 3D turbulence, there are a lot of interesting things that will happen in the near future.

“Weather modelling has become a great success story over the last months. There is potentially much more exciting stuff to come, especially regarding weather forecasting directly from observational data or when building weather foundation models.” — Johannes Brandstetter (JKU Linz)

🔮 What to expect in 2024:

1️⃣ More work regarding 3D turbulence modelling.

2️⃣ Multi-modality aspects of PDEs might emerge. This could include combining different PDEs, different resolutions, or different discretization schemes. We are already seeing a glimpse thereof in e.g. Multiple Physics Pretraining for Physical Surrogate Models by McCabe et al.

Predictions from the 2023 post

(1) Neural PDEs and their applications are likely to expand to more physics-related AI4Science subfields; computational fluid dynamics (CFD) will potentially be influenced by GNN.

✅ We are seeing 3D turbulence modelling, geometry-aware neural operators, particle-based neural surrogates, and a huge impact in e.g. weather forecasting.

(2) GNN based surrogates might augment/replace traditional well-tried techniques.

✅ Weather forecasting has become a great success story. Neural network based weather forecasts overtake traditional forecasts (medium range+local forecasts), e.g., GraphCast by Lam et al. and MetNet-3 by Andrychowicz et al.

Robustness and Explainability

Kexin Huang (Stanford)

“As GNNs are getting deployed in various domains, their reliability and robustness have become increasingly important, especially in safety-critical applications (e.g. scientific discovery) where the cost of errors is significant.” — Kexin Huang (Stanford)

1️⃣ When discussing the reliability of GNNs, a key criterion is uncertainty quantification — quantifying how much the model knows about the prediction. There are numerous works on estimating and calibrating uncertainty, also designed specifically for GNNs (e.g. GATS). However, they fall short of achieving pre-defined target coverage (i.e. % of points falling into the prediction set) both theoretically and empirically. I want to emphasize that this notion of having a coverage guarantee is critical especially in ML deployment for scientific discovery since practitioners often trust a model with statistical guarantees.

2️⃣ Conformal prediction is an exciting direction in statistics where it has finite sample coverage guarantees and has been applied in many domains such as vision and NLP. But it is unclear if it can be used in graphs theoretically since it is not obvious if the exchangeability assumption holds for graph settings. In 2023, we see conformal prediction has been extended to graphs. Notably, CF-GNN and DAPS have derived theoretical conditions for conformal validity in transductive node-level prediction setting and also developed methods to reduce the prediction set size for efficient downstream usage. More recently, we have also seen conformal prediction extensions to link prediction, non-uniform split, edge exchangeability, and also adaptations for settings where exchangeability does not hold (such as inductive setting).

Conformal prediction for graph-structured data. (1) A base GNN model (GNN) that produces prediction scores µ for node i. (2) Conformal correction. Since the training step is not aware of the conformal calibration step, the size/length of prediction sets/intervals (i.e. efficiency) are not optimized. We use a topology-aware correction model that takes µ as the input node feature and aggregates information from its local subgraph to produce an updated prediction µ˜. (3) Conformal prediction. We prove that in a transductive random split setting, graph exchangeability holds given permutation invariance. Thus, standard CP can be used to produce a prediction set/interval based on µ˜ that includes true label with pre-specified coverage rate 1-α. Source: Huang et al.

🔮 Looking ahead, we expect more extensions to cover a wide range of GNN deployment use cases. Overall, I think having statistical guarantees for GNNs is very nice because it enables the trust of practitioners to use GNN predictions.

Graph Transformers

Chen Lin (Oxford)

💡 In 2023, we have seen the continuation of the rise of Graph Transformers. It has become the common GNN design, e.g., in GATr, the authors attribute its popularity to its “favorable scaling properties, expressiveness, trainability, and versatility”.

1️⃣ Expressiveness of GTs. As mentioned in the GNN Theory section, recent work from Cai et al. (2023) shows the equivalence between MPNNs with a Virtural Node and GTs under a non-uniform setting. This poses a question on how powerful are GTs and what is the source of their representation ability. Zhang et al. (2023) successfully combine a new powerful positional embedding (PE) to improve the expressiveness of their GTs, achieving expressivity over the biconnectivity problem. This gives evidence of the importance of PEs to the expressiveness of GTs. A recent submission GPNN provides a clearer view on the central role of the positional encoding. It has been shown that one can generalize the proof in Zhang et al. (2023) to show how GTs’ expressiveness is decided by various positional encodings.

2️⃣ Positional (Structural) Encoding. Given the importance of PE/SE to GTs, now we turn to the design of those expressive features usually derived from existing graph invariants. In 2022, GraphGPS observed a huge empirical success by combining GTs with various (or even multiple) PE/SEs. In 2023, more powerful PE/SE is available.

Relative Random Walk PE (RRWP) proposed by Ma et al generalizes the random walk structural encoding with the relational part. Together with a new variant of attention mechanism, GRIT achieves a strong empirical performance compared with existing PE/SEs on property prediction benchmarks (SOTA on ZINC). Theoretically, RRWP can approximate the Shortest path distance, personalized PageRank, and heat kernel with a specific choice of parameters. With RRWP, GRIT is more expressive than SPD-WL.

RRWP visualization for the fluorescein molecule, up to the 4th power. Thicker and darker edges indicate higher edge weight. Probabilities for longer random walks reveal higher-order structures (e.g., the cliques evident in 3-RW and the star patterns in 4-RW). Source: Ma et al.

Puny et al proposed a new theoretical framework for expressivity based on Equivariant Polynomials where the expressivity of common GNNs can be improved by having the polynomial features, computed with tensor contractions based on the equivariant basis, as positional encodings. The empirical results are surprising: GatedGCNs is improved from a test MAE of 0.265 to 0.106 with the d-expressive polynomials. It will be very interesting to see if someone combines this with GTs in the future.

3️⃣ Efficient GTs. It remains challenging for GTs to be applied to large graphs due to the O(N²) complexity. In 2023, we saw more works trying to eliminate such difficulty by lowering the computation complexity of GTs. Deac et al used expander graphs for the propagation, which is regularly connected with few edges. Exphormer extended this idea to GT by combining expander graphs with the local neighborhood aggregation and virtual node. Exphormer allows graph transformers to scale to larger graphs (as large as ogbn-arxiv with 169K nodes). It also achieved strong empirical results and ranked top on several Long-Range Graph Benchmark tasks.

🔮 Moving forward to 2024:

A better understanding of self-attention’s benefits on abstract beyond expressiveness.
Big open-source pre-trained equivariant GT in 2024!
More powerful positional encodings.

New Datasets & Benchmarks

Structural biology: Pinder from VantAI, PoseBusters from Oxford, PoseCheck from The Other Place, DockGen, and LargeMix and UltraLarge datasets from Valence Labs

Temporal Graph Benchmark (TGB): Until now, progress in temporal graph learning has been held back by the lack of large high-quality datasets, as well as the lack of proper evaluation thus leading to over-optimistic performance. TGB addresses this by introducing a collection of seven realistic, large-scale and diverse benchmarks for learning on temporal graphs, including both node-wise and link-wise tasks. Inspired by the success of OGB, TGB automates dataset downloading and processing as well as evaluation protocols, and allows users to compare model performance using a leaderboard. Check out the associated blog post for more details.

TpuGraphs from Google Research: the graph property prediction dataset of TPU computational graphs. The dataset provides 25x more graphs than the largest graph property prediction dataset (with comparable graph sizes), and 770x larger graphs on average compared to existing performance prediction datasets on machine learning programs. Google ran Kaggle competition based off TpuGraphs!

LagrangeBench: A Lagrangian Fluid Mechanics Benchmarking Suite — where you can evaluate your favorite GNN-based simulator in a JAX-based environment (for JAX aficionados).

RelBench: Relational Deep Learning Benchmark from Stanford and Kumo.AI: make time-based predictions over relational databases (which you can model as graphs or hypergraphs).

The GNoMe dataset from Google DeepMind: 381k more novel stable materials for your materials discovery and ML potentials models!

Conferences, Courses & Community

The main events in the graph and geometric learning world (apart from big ML conferences) grow larger and more mature: The Learning on Graphs Conference (LoG), Molecular ML (MoML), and the Stanford Graph Learning Workshop. The LoG conference features a cool format with the remote-first conference and dozens of local meetups organized by community members spanning the whole globe from China to UK & Europe to the US West Coast 🌏🌍🌎 .

The LoG meetups in Amsterdam, Paris, Tromsø, and Shanghai. Source: Slack of the LoG community

Courses, books, and educational resources

Geometric GNN Dojo — a pedagogical resource for beginners and experts to explore the design space of GNNs for geometric graphs (pairs best with the recent Hitchhiker’s Guide to Geometric GNNs).
TorchCFM — the main entrypoint to the world of flow matching.
The PyT team maintains TopoNetX, TopoModelX, and TopoEmbedX — the most hands-on libraries to jump into topological deep learning.
The book on Equivariant and Coordinate Independent Convolutional Networks: A Gauge Field Theory of Neural Networks by Maurice Weiler, Patrick Forré, Erik Verlinde, and Max Welling — brings together the findings on the representation theory and differential geometry of equivariant CNNs

Surveys

ML for Science in Quantum, Atomistic, and Continuum systems by well over 60 authors from 23 institutions (Zhang, Wang, Helwig, Luo, Fu, Xie et al.)
Scientific discovery in the age of artificial intelligence by Wang et al published in Nature.

Prominent seminar series

Slack communities

Memes of 2023

Commemorating the successes of flow matching in 2023 in the meme and unique t-shirts brought to NeurIPS’23. Right: Hannes Stärk and Michael Galkin are making a statement at NeurIPS’23. Images by Michael Galkin

GNN aggregation functions are actually portals to category theory (Created by Petar Veličković)

Michael Bronstein continues to harass Google by demanding his DeepMind chair at every ML conference, but so far, he has only been offered stools (photo credits: Jelani Nelson and Thomas Kipf).

The authors of this blog post congratulate you upon completing the long read. Michael Galkin and Michael Bronstein with the Meme of 2022 at ICML 2023 in Hawaii (Photo credit: Ben Finkelshtein)

For additional articles about geometric and graph deep learning, see Michael Galkin’s and Michael Bronstein’s Medium posts and follow the two Michaels (Galkin and Bronstein) on Twitter.

Graph & Geometric ML in 2024: Where We Are and What’s Next (Part I — Theory & Architectures) was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

ULTRA: Foundation Models for Knowledge Graph Reasoning

Michael Galkin — Fri, 03 Nov 2023 17:03:49 GMT

What’s new in Graph ML?

One model to rule them all

Training a single generic model for solving arbitrary datasets is always a dream for ML researchers, especially in the era of foundation models. While such dreams have been realized in perception domains like images or natural languages, whether they can be reproduced in reasoning domains (like graphs) remains an open challenge.

Image by Authors edited from the output of DALL-E 3.

In this blog post, we prove such a generic reasoning model exists, at least for knowledge graphs (KGs). We create ULTRA, a single pre-trained reasoning model that generalizes to new KGs of arbitrary entity and relation vocabularies, which serves as a default solution for any KG reasoning problem.

This post is based on our recent paper (preprint) and was written together with Xinyu Yuan (Mila), Zhaocheng Zhu (Mila), and Bruno Ribeiro (Purdue / Stanford). Follow Michael, Xinyu, Zhaocheng, and Bruno on Twitter for more Graph ML content.

Outline

Why KG representation learning is stuck in 2018
Theory: What makes a model inductive and transferable?
Theory: Equivariance in multi-relational graphs
ULTRA: A Foundation Model for KG Reasoning
Experiments: Best even in the zero-shot inference, Scaling behavior
Code, Data, Checkpoints

Why KG representation learning is stuck in 2018

The pretrain-finetune paradigm has been with us since 2018 when ELMo and ULMFit showed first promising results and they were later cemented with BERT and GPT.

In the era of large language models (LLM) and more general foundation models (FMs), we often have a single model (like GPT-4 or Llama-2) pre-trained on enormous amounts of data and capable of performing a sheer variety of language tasks in the zero-shot manner (or at least be fine-tuned on the specific dataset). These days, multimodal FMs even support language, vision, audio, and other modalities in the same one model.

Things work a little differently in Graph ML. Particularly, what’s up with representation learning on KGs at the end of 2023? The main tasks here are edge-level:

Entity prediction (or knowledge graph completion) (h,r,?): given a head node and relation, rank all nodes in the graph that can potentially be true tails.
Relation prediction (h,?,t): given two nodes, predict a relation type between them

Turns out, up until now it has been somewhere in pre-2018. The key problem is:

Each KG has its own set of entities and relations, there is no single pre-trained model that would transfer to any graph.

For example, if we look at Freebase (a KG behind Google Knowledge Graph) and Wikidata (the largest open-source KG), they have absolutely different sets of entities (86M vs 100M) and relations (1500 vs 6000). Is there any hope for current KG representation learning methods to be trained on one graph and transfer to another?

Different vocabularies of Freebase and Wikidata. Image by Authors.

❌ Classical transductive methods like TransE, ComplEx, RotatE, and hundreds of other embedding-based methods learn a fixed set of entities and relation types from the training graph and cannot even support new nodes added to the same graph. Shallow embedding-based methods do not transfer (in fact, we believe there is no point in developing such methods anymore except for some student project exercises).

🟡 Inductive entity methods like NodePiece and Neural Bellman-Ford Nets do not learn entity embeddings. Instead, they parameterize training (seen) and new inference (unseen) nodes as a function of fixed relations. Since they learn only relation embeddings, it does allow them to transfer to graphs with new nodes but transfer to new graphs with different relations (like Freebase to Wikidata) is still beyond reach.

Relative entity representations enable inductive GNNs. Image by Authors.

What to do if you have both new entities and relations at inference time (a completely new graph)? If you don’t learn entity or relation embeddings, is the transfer theoretically possible? Let’s look into the theory then.

Theory: What makes a model inductive and transferable?

Let’s define the setup more formally:

KGs are directed, multi-relational graphs with arbitrary sets of nodes and relation types
Graphs arrive without features, that is, we don’t assume the existence of textual descriptions (nor pre-computed feature vectors) of entities and relations.
Given a query (head, relation, ?), we want to rank all nodes in the underlying graph (inference graph) and maximize the probability of returning a true tail.
Transductive setup: the set of nodes and entities is the same at training and inference time.
Inductive (entity) setup: the set of relations has to be fixed at training time, but nodes can be different at training and inference
Inductive (entity and relation) setup: both new unseen entities and relations are allowed at inference

What do neural networks learn to be able to generalize to new data? The primary reference— the book on Geometric Deep Learning by Bronstein, Bruna, Cohen, and Veličković—posits that it is a question of symmetries and invariances.

What are the learnable invariances in foundation models? LLMs are trained on a fixed vocabulary of tokens (sub-word units, bytes, or even randomly initialized vectors as in Lexinvariant LLMs), vision models learn functions to project image patches, audio models learn to project audio patches.

What are the learnable invariances for multi-relational graphs?

First, we will introduce the invariances (equivariances) in standard homogeneous graphs.

Standard (single) permutation equivariant graph models: A great leap in graph ML came when early GNN work (Scarselli et al. 2008, Xu et al. 2018, Morris et al. 2018) has shown that inductive tasks on graphs benefited enormously from assuming that vertex IDs are arbitrary, such that the predictions of a graph model should not change if we reassigned vertex ID. This is known as permutation equivariance of the neural network on node IDs. This realization has created great excitement and a profusion of novel graph representation methods since, as long as the neural network is equivariant to node ID permutations, we can call it a graph model.

Single-relational graphs. GNNs are equivariant to node permutations: Michael Jackson’s node vector will have the same value even after re-labeling node IDs. Image by Authors.

The permutation equivariance on node IDs allows GNNs to inductively (zero-shot) transfer the patterns learned from a training graph to another (different) test graph. This is a consequence of the equivariance, since the neural network cannot use node IDs to produce embeddings, it must use the graph structure. This creates what we know as structural representations in graphs (see Srinivasan & Ribeiro (ICLR 2020)).

Equivariance in multi-relational graphs

Now edges in the graphs might have different relation types — is there any GNN theory for such graphs?

1️⃣ In our previous work, Weisfeiler and Leman Go Relational (with Pablo Barceló, Christopher Morris, and Miguel Romero Orth, LoG 2022), we derived Relational WL — a WL expressiveness hierarchy for multi-relational graphs focusing more on node-level tasks. The great follow-up work by Huang et al (NeurIPS 2023) extended the theory to link prediction, formalized conditional message passing, and logical expressiveness using Relational WL. ✍️ Let’s remember conditional message passing — we’ll need it later — it provably improves link prediction performance.

The proposed addition of a global readout vector induced by incoming/outgoing edge direction resembles the recent work of Emanuele Rossi et al on studying directionality in homogeneous MPNNs (read the blog post on Medium for more details). Still, those works do not envision the case when even relations at test time are unseen.

2️⃣ Double permutation equivariant (multi-relational) graph models: Recently, Gao et al. 2023 proposed the concept of double equivariance for multi-relational graphs. Double equivariance forces the neural network to be equivariant to the joint permutations of both node IDs and relation IDs. This ensures the neural network learns structural patterns between nodes and relations, which allows it to inductively (zero-shot) transfer the learned patterns to another graph with new nodes and new relations.

Double equivariance in multi-relational graphs. Permuting both node IDs and relation IDs does not change the relational structure. Hence, the output node states should be the same (but permuted). Image by Authors.

➡️ In our work, we find the invariance of relation interactions, that is, even if relation identities are different, their fundamental interactions remain the same, and those fundamental interactions can be captured by a graph of relations. In the graph of relations, each node is a relation type from the original graph. Two nodes in this graph will be connected if edges with those relation types in the original graph are incident (that is, they share a head or tail node). Depending on the incidence, we distinguish 4 edge types in the graph of relations:

Head-to-head (h2h) — two relations can start from the same head entity;
Tail-to-head (t2h) — tail entity of one relation can be a head of another relation;
Head-to-tail (h2t) — head entity of one relation can be a tail of another relation;
Tail-to-tail (t2t) — two relations can have the same tail entity.

Different incidence patterns in the original graph produce different interactions in the graph of relations. The right-most: the example relation graph (inverse edges are omitted for clarity). Image by Authors

A few nice properties of the relation graph:

It can be built from absolutely any multi-relational graph (with simple sparse matrix multiplications)
The 4 fundamental interactions never change because they just encode the basic topology — in directed graphs there always will be head and tail nodes, and we relations would have those incidence patterns

Essentially, learning representations over the relations graph can transfer to any multi-relational graph! This is the learnable invariance.

In fact, it can be shown (we are already working on the formal proofs which will be available in an upcoming work 😉) that representing relations via their interactions in a graph of relations is a double equivariant model! This means that learned relational representations are independent of identities but rather rely on the joint interactions between relations, nodes, and nodes & relations.

ULTRA: A Foundation Model for KG Reasoning

With all the theoretical foundations backing us up, we are now ready to introduce ULTRA.

ULTRA is a method for unified, learnable, and transferable graph representations. ULTRA leverages the invariances (and equivariances) of the graph of relations with its fundamental interactions and applies conditional message passing to get relative relational representations. Perhaps the coolest fact is that

a single pre-trained ULTRA model can run 0-shot inference on any possible multi-relational graph and be fine-tuned on any graph.

In other words, ULTRA is pretty much a foundation model that can run inference on any graph input (with already good performance) and be fine-tuned on any target graph of interest.

The crucial component of ULTRA is in relative relation representations constructed from the graph of relations. Given a query (Michael Jackson, genre, ?), we first initialize the genre node in the graph of relations with the all-ones vector (all other nodes are initialized with zeros). Running a GNN, the resulting node embeddings of the relation graph are conditioned on the genre node — it means that each starting initialized relation will have its own matrix of relational features, and that’s very helpful from many theoretical and practical aspects!

ULTRA employs relative relation representations (a labeling trick over the graph of relations) such that each relation (eg, “genre”) has its own unique matrix of all relation representations. Image by Authors.

Practically, given an input KG and a (h, r, ?) query, ULTRA executes the following actions:

Construction of the graph of relations;
Get relation features from the conditional message passing GNN on the graph of relations (conditioned on the initialized query relation r);
Use the obtained relational representations for the inductive link predictor GNN conditioned on the initialized head node h;

Steps 2 and 3 are implemented via slightly different modifications of the Neural Bellman-Ford net (NBFNet). ULTRA only learns embeddings of the 4 fundamental interactions (h2t, t2t, t2h, h2h) and GNN weights — pretty small overall. The main model we experimented with has only 177k parameters.

Experiments: Best even in the zero-shot inference and Fine-tuning

We pre-trained ULTRA on 3 standard KGs based on Freebase, Wikidata, and Wordnet, and ran 0-shot link prediction on 50+ other KGs of various sizes from 1k — 120k nodes and 2k edges — 1.1M edges.

Averaged across the datasets with known SOTA, a single pre-trained ULTRA model is better in the 0-shot inference mode than existing SOTA models trained specifically on each graph 🚀Fine-tuning improves the performance even 10% further. It’s particularly amazing that a single trained ULTRA model can scale to graphs of such different sizes (100x difference in node size and 500x in the edge sizes) whereas GNNs are known to suffer from size generalization issues (see the prominent works by Yehudai et al, ICML 2021 and Zhou et al, NeurIPS 2022).

A single pre-trained ULTRA is better even in the 0-shot inference mode than supervised SOTA modes trained end-to-end on specific graphs (look at the Average column). Fine-tuning improves the performance even further. Image by Authors

🙃 In fact, with 57 tested graphs, we ran a bit out of KGs to test ULTRA on. So if you have a fresh new benchmark hidden somewhere — let us know!

Scaling Behavior

We can bump the zero-shot performance even more by adding more graphs to the pre-training mixture although we do observe certain performance saturation after training on 4+ graphs.

The church of Scaling Laws predicts even better performance with bigger models trained on more qualitative data, so it’s definitely on our agenda.

Zero-shot performance increases with more diverse graphs in the pre-training mix. Image by Authors.

Conclusion: Code, Data, Checkpoints

So foundation models for KG reasoning are finally here, we are past that 2018 threshold! A single pre-trained ULTRA model can perform link prediction on any KG (multi-relational graph) from any domain. You really just need a graph with more than 1 edge type to get going.

📈 Practically, ULTRA demonstrates very promising performance on a variety of KG benchmarks already in the 0-shot mode, but you can bump the performance even further with a short fine-tuning.

We make all the code, training data, and pre-trained model checkpoints available on GitHub so you can start running ULTRA on your data right away!

📜 preprint: arxiv

🛠️ Code, data: Githtub repo

🍪 Checkpoints: 2 checkpoints (2 MB each) in the Github repo

🌎 Project website: here

As a closing remark, KG reasoning just represents a fraction of the many interesting problems in the reasoning domain, and the majority still don’t have a generic solution. We believe the success of KG reasoning will bring more breakthroughs in other reasoning domains (for example, we recently found that LLMs can actually learn and employ textual rules). Let’s stay optimistic about the future of reasoning!

ULTRA: Foundation Models for Knowledge Graph Reasoning was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Graph Machine Learning @ ICML 2023

Michael Galkin — Sun, 06 Aug 2023 02:07:28 GMT

What’s new in Graph ML?

Recent advancements and hot trends, August 2023 edition

Magnificent beaches and tropical Hawaiian landscapes 🌴did not turn brave scientists away from attending the International Conference on Machine Learning in Honolulu and presenting their recent work! Let’s see what’s new in our favorite Graph Machine Learning area.

Image By Author.

Thanks Santiago Miret for proofreading the post.

To make the post less boring about papers, I took some photos around Honolulu 📷

Table of contents (clickable):

Graph Transformers: Sparser, Faster, and Directed
Theory: VC dimension of GNNs, deep dive in over-squashing
New GNN architectures: delays and half-hops
Generative Models — Stable Diffusion for Molecules, Discrete diffusion
Geometric Learning: Geometric WL, Clifford Algebras
Molecules: 2D-3D pretraining, Uncertainty Estimation in MD
Materials & Proteins: CLIP for proteins, Ewald Message Passing, Equivariant Augmentations
Cool Applications: Algorithmic reasoning, Inductive KG completion, GNNs for mass spectra
The Concluding Meme Part

Graph Transformers: Sparser, Faster, and Directed

We presented GraphGPS about a year ago and it is pleasing to see many ICML papers building upon our framework and expanding GT capabilities even further.

➡️ Exphormer by Shirzad, Velingker, Venkatachalam et al adds a missing piece of graph-motivated sparse attention to GTs: instead of BigBird or Performer (originally designed for sequences), Exphormer’s attention builds upon 1-hop edges, virtual nodes (connected to all nodes in a graph), and a neat idea of expander edges. Expander graphs have a constant degree and are shown to approximate fully-connected graphs. All components combined, attention costs O(V+E) instead of O(V²). This allows Exphormer to outperform GraphGPS almost everywhere and scale to really large graphs of up to 160k nodes. Amazing work and all chances to make Exphormer the standard sparse attention mechanism in GTs 👏.

➡️ Concurrently with graph transformers, expander graphs can already be used to enhance the performance of any MPNN architecture as shown in Expander Graph Propagation by Deac, Lackenby, and Veličković.

In a similar vein, Cai et al show that MPNNs with virtual nodes can approximate linear Performer-like attention such that even classic GCN and GatedGCN imbued with virtual nodes show pretty much a SOTA performance in long-range graph tasks (we released the LGRB benchmark last year exactly for measuring long-range capabilities of GNNs and GTs).

Source: Shirzad, Velingker, Venkatachalam et al

➡️ A few patch-based subsampling approaches for GTs inspired by vision models: “A Generalization of ViT/MLP-Mixer to Graphs” by He et al split the input into several patches, encode each patch with a GNN into a token, and run a transformer over those tokens.

Source: “A Generalization of ViT/MLP-Mixer to Graphs” by He et al

In GOAT by Kong et al, node features are projected into a codebook of K clusters with K-Means, and a sampled 3-hop neighborhood of each node attends to the codebook. GOAT is a 1-layer model and scales to graphs of millions of nodes.

➡️ Directed graphs got some transformer love as well 💗. “Transformers Meet Directed Graphs” by Geisler et al introduces Magnetic Laplacian — a generalization of a Laplacian for directed graphs with a non-symmetric adjacency matrix. Eigenvectors of the Magnetic Laplacian paired with directed random walks are strong input features for the transformer that enable setting a new SOTA on the OGB Code2 graph property prediction dataset by a good margin!

🏅 Last but not least, we have a new SOTA GT on the community standard ZINC dataset — GRIT by Ma, Lin, et al incorporates the full d-dimensional random walk matrix, coined as relative random walk probabilities (RRWP), as edge features to the attention computation (for comparison, popular RWSE features are just the diagonal elements of this matrix). RRWP are provably more powerful than shortest path distance features and set a record-low 0.059 MAE on ZINC (down from 0.070 by GraphGPS). GRIT often outperforms GPS in other benchmarks as well 👏. In a similar vein, Eliasof et al propose a neat idea to combine random and spectral features as positional encodings that outperform RWSE but were not tried with GTs.

Image by Author.

Theory: VC dimension of GNNs, deep dive into over-squashing

➡️ VC dimension measures model capacity and expressiveness. It is studied well for classical ML algorithms but, surprisingly, has never been applied to study GNNs. In “WL meet VC” by Morris et al, the connection between the WL test and VC dimension is finally uncovered — turns out it the VC dimension can be bounded by the bitlength of GNN weights, i.e., float32 weights would imply the VC dimension of 32. Furthermore, the VC dimension depends logarithmically on the number of unique WL colors in the given task and polynomially on the depth and number of layers. This is a great theoretical result and I’d encourage you to have a look!

Source: “WL meet VC” by Morris et al

🍊🖐️ The over-squashing effect — information loss when you try to stuff messages from too many neighboring nodes — is another common problem of MPNNs, and we don’t fully understand how to properly deal with it. This year, there were 3 papers dedicated to this topic. Perhaps the most foundational is the work by Di Giovanni et al that explains how MPNNs width, depth, and graph topology affect over-squashing.

Source: Di Giovanni et al

Turns out that width might help (but with generalization issues), depth does not really help, and graph topology (characterized by the commute time between nodes) plays the most important role. We can reduce the commute time by various graph rewiring strategies (adding and removing edges based on spatial or spectral properties), and there are many of them (you might have heard about the Ricci flow-based rewiring that took home the Outstanding Paper award at ICLR 2022). In fact, there is a follow-up work to this study that goes even deeper and derives some impossibility statements wrt over-squashing and some MPNN properties — I’d highly encourage to read it as well!

➡️ Effective resistance is one example of spatial rewiring strategies, and Black et al study it in great detail. The Ricci flow-based rewiring works with graph curvature and is studied further in the work by Nguyen et al.

➡️ Subgraph GNNs continue to be in the spotlight: two works (Zhang, Feng, Du, et al and Zhou, Wang, Zhang) concurrently derive expressiveness hierarchies of the recently proposed subgraph GNNs and their relationship to the 1- and higher-order WL tests.

Image By Author.

New GNN architectures: Delays and Half-hops

If you are tired of yet another variation of GCN or GAT, here are some fresh ideas that can work with any GNN of your choice:

⏳ As we know from the Theory section, rewiring helps combat over-squashing. Gutteridge et al introduce “DRew: Dynamically Rewired Message Passing with Delay” which gradually densifies the graph in later GNN layers such that long-distance nodes see the original states of previous nodes (the original DRew) or those skip-connections are added based on the delay — depending on a distance between two nodes (the vDRew version). For example ( 🖼️👇), in vDRew delayed message passing, a starting node from layer 0 will show its state to 2-hop neighbors on layer 1, and will show its state to a 3-hop neighbor on layer 2. DRew significantly improves the ability of vanilla GNNs to perform long-range tasks — in fact, a DRew-enabled GCN is the current SOTA on the Peptides-func dataset from the Long Range Graph Benchmark 👀

Source: Gutteridge et al

🦘 Another neat idea by Azabou et al is to slow down message passing by inserting new, slow nodes at each edge with a special connectivity pattern — only an incoming connection from the starting node and a symmetric edge with the destination node. Slow nodes improve the performance of vanilla GNNs on heterophilic benchmarks by a large margin, and it is also possible to use slow nodes for self-supervised learning by creating views with different locations of slow nodes for the same original graph. HalfHop is a no-brainer-to-include SSL component that boosts performance and should be in a standard suite of many GNN libraries 👍.

Source: Azabou et al

Image By Author.

Generative Models — Stable Diffusion for Molecules, Discrete Diffusion

➡️ Diffusion models might work in the feature space (e.g., pixel space in image generation like the original DDPM) or in the latent space (like Stable Diffusion). In the feature space, you have to design the noising process to respect symmetries and equivariances of your feature space. In the latent space, you can just add Gaussian noise to the features produced by (pre-trained) encoder. Most 3D molecule generation models work in the feature space (like a pioneering EDM), and the new GeoLDM model by Xu et al (authors of the prominent GeoDiff) is the first to define latent diffusion for 3D molecule generation. That is, after training an EGNN autoencoder, GeoLDM is trained on the denoising objective where noise is sampled from a standard Gaussian. GeoLDM brings significant improvements over EDM and other non-latent diffusion approaches 👏.

GeoLDM. Source: Xu et al

➡️ In the realm of non-geometric graphs (just with an adjacency and perhaps categorical node features), discrete graph diffusion pioneered by DiGress (ICLR’23) seems the most applicable option. Chen et al propose EDGE, a discrete diffusion model guided by the node degree distribution. In contrast to DiGress, the final target graph in EDGE is a disconnected graph without edges, a forward noising model removes edges through a Bernoulli distribution, and a reverse process adds edges to the most recent active nodes (active are the nodes whose degrees changed in the previous step). Thanks to the sparsity introduced by the degree guidance, EDGE can generate pretty large graphs up to 4k nodes and 40k edges!

Graph Generation with EDGE. Source:Chen et al

➡️ Finally, “Graphically Structured Diffusion Models” by Weilbach et al bridges the gap between continuous generative models and probabilistic graphical models that induce a certain structure in the problem of interest — often such problems have a combinatorial nature. The central idea is to encode the problem’s structure as an attention mask that respects permutation invariances and use this mask in the attention computation in the Transformer encoder (which by definition is equivariant to input token permutation unless you use positional embeddings). GSDM can tackle binary continuous matrix factorization, boolean circuits, can generate sudokus, and perform sorting. Particularly enjoyable is a pinch of irony the paper is written with 🙃.

GSDM task-to-attention-bias. Source: “Graphically Structured Diffusion Models” by Weilbach et al

Image By Author

Geometric Learning: Geometric WL, Clifford Algebras

Geometric Deep Learning thrives! There were so many interesting papers presented that would take pretty much the whole post, so I’d highlight only a few.

➡️ Geometric WL has finally arrived in the work by Joshi, Bodnar, et al. Geometric WL extends the notion of the WL test with geometric features (e.g., coordinates or velocity) and derives the expressiveness hierarchy up to k-order GWL. Key takeaways: 1️⃣ equivariant models are more expressive than invariant (with a note that in fully connected graphs the difference disappears), 2️⃣ tensor order of features improves expressiveness, 3️⃣ body order of features improves expressiveness (see the image 👇). That is, spherical > cartesian > scalars, and many-body interactions > just distances. The paper also features the amazing learning source Geometric GNN Dojo where you can derive and implement most SOTA models from the first principles!

Source: Joshi, Bodnar, et al

➡️ Going beyond vectors to Clifford algebras, Ruhe et al derive Geometric Clifford Algebra Networks (GCANs). Clifford algebras naturally support higher-order interactions by means of bivectors, trivectors, and (in general) multivectors. The key idea is the Cartan-Dieudonné theorem that every orthogonal transformation can be decomposed into reflections in hyperplanes, and geometric algebras represent data as the elements of the Pin(p,q,r) group. GCANs introduce a notion of linear layers, normalizations, non-linearities, and how they can be parameterized with neural networks. Experiments include modeling fluid dynamics and Navier-Stokes equations.

Source: Ruhe et al

In fact, there is already a follow-up work introducing equivariant Clifford NNs — you can learn more about Clifford algebras foundations and the most recent papers on CliffordLayers supported by Microsoft Research.

💊 Equivariant GNN (EGNN) is the Aspirin of Geometric DL that gets applied to almost every task and has seen quite a number of improvements. Eijkelboom et al marry EGNN with Simplicial networks that operate on higher-order structures (namely, simplicial complexes) into EMPSN. This is one of the first examples that combines geometric and topological features and has great improvement potential! Finally, Passaro and Zitnick derive a neat trick to reduce SO(3) convolutions to SO(2) bringing the complexity down from O(L⁶) to O(L³) but with mathematical equivalence guarantees 👀. This finding allows to scale up geometric models on larger datasets like OpenCatalyst and already made it to Equiformer V2 — soon in many other libraries for geometric models 😉

Image By Author.

Molecules: 2D-3D pretraining, Uncertainty Estimation in MD

➡️ Liu, Du, et al propose MoleculeSDE, a new framework for joint 2D-3D pretraining on molecular data. In addition to standard contrastive loss, the authors add two generative components: reconstructing 2D -> 3D and 3D -> 2D inputs through the score-based diffusion generation. Using standard GIN and SchNet as 2D and 3D models, MoleculeSDE is pre-trained on PCQM4M v2 and performs well on downstream fine-tuning tasks.

Source: MoleculeSDE Github repo

➡️ Wollschläger et al perform a comprehensive study of Uncertainty Estimation in GNNs for molecular dynamics and force fields. Identifying key physics-informed and application-focused principles, the authors propose a Localized Neural Kernel, a Gaussian Process-based extension to any geometric GNN that works on invariant and equivariant quantities (tried on SchNet, DimeNet, and NequIP). In many cases, LNK’s estimations from one model are on par with or better than costly ensembling where you’d need to train several models.

Source: Wollschläger et al

Image By Author.

Materials & Proteins: CLIP for proteins, Ewald Message Passing, Equivariant Augmentations

CLIP and its descendants have become a standard staple in text-to-image models. Can we do the same but for text-to-protein? Yes!

➡️ Xu, Yuan, et al present ProtST, a framework for learning joint representations of text protein descriptions (via PubMedBERT) and protein sequences (via ESM). In addition to a contrastive loss, ProtST has a multimodal mask prediction objective, e.g., masking 15% of tokens in text and protein sequence, and predicting those jointly based on latent representations, and mask prediction losses based on sequences or language alone. Additionally, the authors design a novel ProtDescribe dataset with 550K aligned protein sequence-description pairs. ProtST excels across many protein modeling tasks in the PEER benchmark, including protein function annotation and localization, but also allows for zero-shot protein retrieval right from the textual description (see an example below). Looks like ProtST has a bright future of being a backbone behind many protein generative models 😉

Source: Xu, Yuan, et al

Actually, ICML features several protein generation works like GENIE by Lin and AlQuraishi and FrameDiff by Yim, Trippe, De Bortoli, Mathieu, et al — those are not yet conditioned on textual descriptions, so incorporating ProtST there looks like a no-brainer performance boost 📈.

Gif Source: SE(3) Diffusion Github

⚛️ MPNNs on molecules have a strict locality bias that inhibits modeling long-range interactions. Kosmala et al derive Ewald Message Passing and apply the idea of Ewald summation that breaks down the interaction potential into short-range and long-range terms. Short-range interaction is modeled by any GNN while long-range interaction is new and is modeled with a 3D Fourier transform and message passing over Fourier frequencies. Turns out this long-range term is pretty flexible and can be applied to any network modeling periodic and aperiodic systems (like crystals or molecules) like SchNet, DimeNet, or GemNet. The model was evaluated on OC20 and OE62 datasets. If you are interested in more details, check out the 1-hour talk by Arthur Kosmala at the LOG2 Reading Group!

Source: Kosmala et al

A similar idea of using Ewald summation for 3D crystals is used in PotNet by Lin et al where the long-range connection is modeled with incomplete Bessel functions. PotNet was evaluated on the Materials Project dataset and JARVIS — so reading those two papers you can have a good understanding of the benefits brought by Ewald summation for many crystal-related tasks 😉

Source: Lin et al

➡️ Another look at imbuing any GNNs with equivariance for crystals and molecules is given by Duval, Schmidt, et al in FAENet. A standard way is to bake certain symmetries and equivariances right into GNN architectures (like in EGNN, GemNet, and Ewald Message Passing) — this is a safe but computationally expensive way (especially when it comes to spherical harmonics and tensor products). Another option often used in vision — show many augmentations of the same input and the model should eventually learn the same invariances in the augmentations. The authors go for the 2nd path and design a rigorous way to sample 2D / 3D data invariant or equivariant augmentations (e.g., for energy or forces, respectively) all with fancy proofs ✍️. For that, the data augmentation pipeline includes projecting 2D / 3D inputs to a canonical representation (based on PCA of the covariance matrix of distances) from which we can uniformly sample rotations.

The proposed FAENet is a simple model that uses only distances but shows very good performance with the stochastic frame averaging data augmentation while being 6–20 times faster. Works for crystal structures as well!

Augmentations and Stochastic Frame Averaging. Source: Duval, Schmidt, et al

Image By Author.

Cool Applications: Algorithmic Reasoning, Inductive KG Completion, GNNs for Mass Spectra

A few papers in this section did not belong to any of the above but are still worthy of your attention.

➡️ ”Neural Algorithmic Reasoning with Causal Regularisation” by Bevilacqua et al tackles a common issue in graph learning — OOD generalization to larger inputs at test time. Studying OOD generalization in algorithmic reasoning problems, the authors observe that there exist many different inputs that make identical computations at a certain step. At the same time, it means that some subset of inputs does not (should not) affect the prediction result. This assumption allows to design a self-supervised objective (termed Hint-ReLIC) that prefers a “meaningful” step to a bunch of steps that do not affect the prediction result. The new objective significantly bumps the performance on many CLRS-30 tasks to 90+% micro-F1. It is an interesting question whether we could leverage the same principle in general message passing and improve OOD transfer in other graph learning tasks 🤔

Source: ”Neural Algorithmic Reasoning with Causal Regularisation” by Bevilacqua et al

If you are further interested in neural algorithmic reasoning, check out the proceedings of the Knowledge and Logical Reasoning workshop which has even more works on that topic.

➡️ “InGram: Inductive Knowledge Graph Embedding via Relation Graphs” by Lee et al seems to be one of the very few knowledge graph papers at ICML’23 (to the best of my search). InGram is one of the first approaches that can inductively generalize to both unseen entities and unseen relations at test time. Previously, inductive KG models needed to learn at least relation embeddings in some form to generalize to new nodes, and in this paradigm, new unseen relations are non-trivial to model. InGram builds a relation graph on top of the original multi-relational graph, that is, a graph of relation types, and learns representations of relations based on this graph by running a GAT. Entity representations are obtained from the random initialization and a GNN encoder. Having both entity and relation representations, a DistMult decoder is applied as a scoring function. There are good chances that InGram for unseen relations might be as influential as GraIL (ICML 2020) for unseen entities 😉.

Source: “InGram: Inductive Knowledge Graph Embedding via Relation Graphs” by Lee et al

🌈 ”Efficiently predicting high resolution mass spectra with graph neural networks” by Murphy et al is a cool application of GNNs to a real physics problem of predicting mass spectra. The main finding is that most of the signal in mass spectra is explained by a small number of components (product ion and neutral loss formulas). And it is possible to mine a vocabulary of those formulas from training data. The problem can thus be framed as graph classification (or graph property prediction) when, given a molecular graph, we predict tokens from a vocabulary that correspond to certain mass spectra values. The approach, GRAFF-MS, builds molecular graph representation through GIN with edge features, with Laplacian features (via SignNet), and pooled with covariate features. Compared to the baseline CFM-ID, GRAFF-MS performs inference in ~19 minutes compared to 126 hours reaching much higher performance 👀.

Source: ”Efficiently predicting high resolution mass spectra with graph neural networks” by Murphy et al

The Concluding Meme Part

Four Michaels (+ epsilon in the background) on the same photo!

The meme of 2022 has finally converged to Michael Bronstein!

Graph Machine Learning @ ICML 2023 was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Neural Graph Databases

Michael Galkin — Tue, 28 Mar 2023 03:18:26 GMT

What’s New in Graph ML?

A new milestone in graph data management

We introduce the concept of Neural Graph Databases as the next step in the evolution of graph databases. Tailored for large incomplete graphs and on-the-fly inference of missing edges using graph representation learning, neural reasoning maintains high expressiveness and supports complex logical queries similar to standard graph query languages.

Image by Authors, assisted by Stable Diffusion.

This post was written together with Hongyu Ren, Michael Cochez, and Zhaocheng Zhu based on our newest paper Neural Graph Reasoning: Complex Logical Query Answering Meets Graph Databases. You can also follow me, Hongyu, Michael, and Zhaocheng on Twitter. Check our project website for more materials.

Outline:

Neural Graph Databases: What and Why?
The blueprint of NGDBs
Neural Graph Storage
Neural Query Engine
Neural Graph Reasoning for Query Engines
Open Challenges for NGDBs
Learn More

Neural Graph Databases: What and Why?

🍨Vanilla graph databases are pretty much everywhere thanks to the ever-growing graphs in production, flexible graph data models, and expressive query languages. Classical, symbolic graph DBs are fast and cool under one important assumption:

Completeness. Query engines assume that graphs in classical graph DBs are complete.

Under the completeness assumption, we can build indexes, store the graphs in a variety of read/write-optimized formats and expect the DB would return what is there.

But this assumption does not often hold in practice (we’d say, doesn’t hold way too often). If we look at some prominent knowledge graphs (KGs): in Freebase, 93.8% of people have no place of birth and 78.5% have no nationality, about 68% of people do not have any profession, while in Wikidata, about 50% of artists have no date of birth, and only 0.4% of known buildings have information about height. And that’s for the largest KG openly curated by hundreds of enthusiasts. Surely, 100M nodes and 1B statements are not the largest ever graph in the industry, so you can imagine the degree of incompleteness there.

Clearly, to account for incompleteness, in addition to “what is there?” we have to also ask “what is missing?” (or “what can be there?”). Let’s look at the example:

(a) - input query; (b) — incomplete graph with predicted edges (dashed lines); (c) — a SPARQL query returning one answer (UofT) via graph traversal; (d) — neural execution that recovers missing edges and returns two new answers (UdeM, NYU). Image by Authors.

Here, given an incomplete graph (edges (Turing Award, win, Bengio) and (Deep Learning, field, LeCun) are missing) and a query “At what universities do the Turing Award winners in the field of Deep Learning work?” (expressed in a logical form or in some language like SPARQL), a symbolic graph DB would return only one answer UofT reachable by graph traversal. We refer to such answers as easy answers, or existing answers. Accounting for missing edges, we would recover two more answers UdeM and NYU (hard answers, or inferred answers).

How to infer missing edges?

In classical DBs, we don’t have much choice. RDF-based databases have some formal semantics and can be backed by hefty OWL ontologies but, depending on graph size and complexity of inference, it might take an infinite amount of time to complete the inference in SPARQL entailment regimes. Labeled Property Graph (LPG) graph databases do not have built-in means for inferring missing edges at all.
Thanks to the advances in Graph Machine Learning, we can often perform link prediction in a latent (embedding) space in linear time! We can then extend this mechanism to executing complex, database-like queries right in the embedding space.

Neural Graph Databases combine the advantages of traditional graph DBs with modern graph machine learning.

That is, DB principles like (1) graphs as a first-class citizen, (2) efficient storage, and (3) uniform querying interface are now backed by Graph ML techniques such as (1) geometric representations, (2) robustness to noisy inputs, (3) large-scale pretraining and fine-tuning in order to bridge the incompleteness gap and enable neural graph reasoning and inference.

In general, the design principles for NGDBs are:

The data incompleteness assumption — the underlying data might have missing information on node-, link-, and graph-levels which we would like to infer and leverage in query answering;
Inductiveness and updatability — similar to traditional databases that allow updates and instant querying, representation learning algorithms for building graph latents have to be inductive and generalize to unseen data (new entities and relation at inference time) in the zero-shot (or few-shot) manner to prevent costly re-training (for instance, of shallow node embeddings);
Expressiveness — the ability of latent representations to encode logical and semantic relations in the data akin to FOL (or its fragments) and leverage them in query answering. Practically, the set of supported logical operators for neural reasoning should be close to or equivalent to standard graph database languages like SPARQL or Cypher;
Multimodality beyond knowledge graphs — any graph-structured data that can be stored as a node or record in classical databases (consisting, for example, of images, texts, molecular graphs, or timestamped sequences) and can be imbued with a vector representation is a valid source for the Neural Graph Storage and Neural Query Engine.

The key methods to address the NGDB principles are:

Vector representation as the atomic element — while traditional graph DBs hash the adjacency matrix (or edge list) in many indexes, the incompleteness assumption implies that both given edges and graph latents (vector representations) become the sources of truth in the Neural Graph Storage;
Neural query execution in the latent space –, basic operations such as edge traversal cannot be performed solely symbolically due to the incompleteness assumption. Instead, the Neural Query Engine operates on both adjacency and graph latents to incorporate possibly missing data into query answering;

In fact, by answering queries in the latent space (and not sacrificing traversal performance) we can ditch symbolic database indexes altogether.

The main difference between symbolic graph DBs and neural graph DBs: traditional DBs answer the question “What is there?” by edge traversal while neural graph DBs also answer “What is missing?”. Image by Authors.

The Blueprint of NGDBs

Before diving into NGDBs, let’s take a look at neural databases in general — turns out they have been around for a while and you might have noticed that. Many machine learning systems already operate in this paradigm when data is encoded into model parameters and querying is equivalent to a forward pass that can output a new representation or prediction for a downstream task.

Neural Databases: Overview

What is the current state of neural databases? What are the differences between its kinds and what’s special about NGDBs?

Differences between Vector DBs, natural language DBs, and neural graph DBs. Image by Authors

Vector databases belong to the family of storage-oriented systems commonly built around approximate nearest neighbor libraries (ANN) like Faiss or ScaNN (or custom solutions) to answer distance-based queries using Maximum Inner-Product Search (MIPS), L1, L2, or other distances. Being encoder-independent (that is, any encoder yielding vector representations can be a source like a ResNet or BERT), vector databases are fast but lack complex query answering capabilities.
With the recent rise of large-scale pretrained models — or, foundation models — we have witnessed their huge success in natural language processing and computer vision tasks. We argue that such foundation models are also a prominent example of neural databases. There, the storage module might be presented directly with model parameters or outsourced to an external index often used in retrieval-augmented models since encoding all world knowledge even into billions of model parameters is hard. The query module performs in-context learning either via filling in the blanks in encoder models (BERT or T5 style) or via prompts in decoder-only models (GPT-style) that can span multiple modalities, eg, learnable tokens for vision applications or even calling external tools.
Natural Language Databases (NLDB) introduced by Thorne et al model atomic elements as textual facts encoded to a vector via a pre-trained language model (LM). Queries to NLDB are sent as natural language utterances that get encoded to vectors and query processing employs the retriever-reader approach.

Neural Graph Databases is not a novel term — many graph ML approaches tried to combine graph embeddings with database indexes, perhaps RDF2Vec and LPG2Vec are some of the most prominent examples how embeddings can be plugged into existing graph DBs and run on top of symbolic indexes.

In contrast, we posit that NGDBs can work without symbolic indexes right in the latent space. As we show below, there exist ML algorithms that can simulate exact edge traversal-like behavior in embedding space to retrieve “what is there” as well as perform neural reasoning to answer “what is missing”.

Neural Graph Databases: Architecture

A conceptual scheme of Neural Graph Databases. An input query is processed by the Neural Query Engine where the Planner derives a computation graph of the query and the Executor executes the query in the latent space. The Neural Graph Storage employs Graph Store and Feature Store to obtain latent representations in the Embedding Store. The Executor communicates with the embedding store to retrieve and return results. Image by Authors

On a higher level, NGDB contains two main components, Neural Graph Storage and Neural Query Engine. The query answering pipeline starts with the query sent by some application or downstream task already in a structured format (obtained, for example, via semantic parsing if an initial query is in natural language to transform it into a structured format).

The query first arrives to the Neural Query Engine, and, in particular, to the Query Planner module. The task of the Query Planner is to derive an efficient computation graph of atomic operations (projections and logical operations) with respect to the query complexity, prediction tasks, and underlying data storage such as possible graph partitioning.

The derived plan is then sent to the Query Executor that encodes the query in a latent space, executes the atomic operations over the underlying graph and its latent representations, and aggregates the results of atomic operations into a final answer set. The execution is done via the Retrieval module that communicates with the Neural Graph Storage.

The storage layer consists of

1️⃣ Graph Store for keeping the multi-relational adjacency matrix in space- and time-efficient manner (eg, in various sparse formats like COO and CSR);

2️⃣ Feature Store for keeping node- and edge-level multimodal features associated with the underlying graph.

3️⃣ Embedding Store that leverages an Encoder module to produce graph representations in a latent space based on the underlying adjacency and associated features.

The Retrieval module queries the encoded graph representations to build a distribution of potential answers to atomic operations.

Neural Graph Storage

In traditional graph DBs (right), queries are optimized into a plan (often, a tree of join operators) and executed against the storage of DB indexes. In Neural Graph DBs (left), we encode the query (or its steps) in a latent space and execute against the latent space of the underlying graph. Image by Authors.

In traditional graph DBs, storage design often depends on the graph modeling paradigm.

The two most popular paradigms are Resource Description Framework (RDF) graphs and Labeled Property Graphs (LPG). We posit, however, that the new RDF-star (and accompanying SPARQL-star) is going to unify those two merging logical expressiveness of RDF graphs with attributed nature of LPG. Many existing KGs already follow the RDF-star (-like) paradigm like hyper-relational KGs and Wikidata Statement Model.

If we are to envision the backbone graph modeling paradigm in the next years, we’d go for RDF-star.

In the Neural Graph Storage, both the input graph and its vector representations are sources of truth. For answering queries in the latent space, we need:

Query Encoder
Graph Encoder
Retrieval mechanism to match query representation against the graph representation

The graph encoding (embedding) process can be viewed as a compression step but the semantic and structure similarity of entities/relations is kept. The distance between entities/relations in the embedding space should be positively correlated with the semantic/structure similarity. There are many options for the architecture of the encoder — and we recommend sticking to inductive ones to adhere to the NGDB design principles. In our recent NeurIPS 2022 work, we presented two such inductive models.

Query encoding is usually matched with the nature graph encoding such that both of them will be in the same space. Once we have latent representations, the Retrieval module kicks in to extract relevant answers.

The retrieval process can be seen as a nearest neighbor search of the input vector in the embedding space and has 3 direct benefits:

Confidence scores for each retrieved item — thanks to a predefined distance function in the embedding space
Different definitions of the latent space and the distance function — catering for different graphs, eg, tree-like graphs are easier to work in hyperbolic spaces
Efficiency and scalability — retrieval scales to extremely large graphs with billions of nodes and edges

Neural Query Engine

Query planning in NGDBs (left) and traditional graph DBs (right). The NGDB planning (assuming incomplete graphs) can be performed autoregressively step-by-step (1) or generated entirely in one step (2). The traditional DB planning is cost-based and resorts to metadata (assuming complete graphs and extracted from them) such as the number of intermediate answers to build a tree of join operators. Image by Authors

In traditional DBs, a typical query engine performs three major operations. (1) Query parsing to verify syntax correctness (often enriched with a deeper semantic analysis of query terms); (2) Query planning and optimization to derive an efficient query plan (usually, a tree of relational operators) that minimizes computational costs; (3) Query execution that scans the storage and processes intermediate results according to the query plan.

It is rather straightforward to extend those operations to NGDBs.

1️⃣ Query Parsing can be achieved via semantic parsing to a structured query format. We intentionally leave the discussion on a query language for NGDBs for future works and heated public discussions 😉

2️⃣ Query Planner derives an efficient query plan of atomic operations (projections and logical operators) maximizing completeness (all answers over existing edges must be returned) and inference (of missing edges predicted on the fly) taking into account query complexity and underlying graph.

3️⃣ Once the query plan is finalized, the Query Executor encodes the query (or its parts) into a latent space, communicates with the Graph Storage and its Retrieval module, and aggregates intermediate results into the final answer set. There exist two common mechanisms for query execution:

Atomic, resembling traditional DBs, when a query plan is executed sequentially by encoding atomic patterns, retrieving their answers, and executing logical operators as intermediate steps;
Global, when the entire query graph is encoded and executed in a latent space in one step.

The main challenge for neural query execution is matching query expressiveness to that of symbolic languages like SPARQL or Cypher — so far, neural methods can execute queries close to First-Order Logic expressiveness, but we are somewhat halfway there to symbolic languages.

A Taxonomy of Neural Graph Reasoning for Query Engines

The literature on neural methods for complex logical query answering (aka, query embedding) has been growing since 2018 and the seminal NeurIPS work of Hamilton et al on Graph Query Embedding (GQE). GQE was able to answer conjunctive queries with intersections and predict missing links on the fly.

GQE can be considered as the first take on Neural Query Engines for NGDBs.

GQE started the whole subfield of Graph Machine Learning followed by some prominent examples like Query2Box (ICLR 2020) and Continuous Query Decomposition (ICLR 2021). We undertook a major effort categorizing all those (about 50) works along 3 main directions:

⚛️ Graphs — what is the underlying structure against which we answer queries;
🛠️ Modeling — how we answer queries and which inductive biases are employed;
🗣️ Queries — what we answer, what are the query structures and what are the expected answers.

The taxonomy of neural approaches for complex logical query answering. See the for more details. Image by Authors

⚛️ Talking about Graphs, we further break them down into Modality (classic triple-only graphs, hyper-relational graphs, hypergraphs, and more), Reasoning Domain (discrete entities or including continuous outputs), and Semantics (how neural encoders capture higher-order relationships like OWL ontologies).

🛠️ In Modeling, we follow the Encoder-Processor-Decoder paradigm classifying inductive biases of existing models, eg, transductive or inductive encoders with neural or neuro-symbolic processors.

🗣️ In Queries, we aim at mapping the set of queries answerable by neural methods with that of symbolic graph query languages. We talk about Query Operators (going beyond standard And/Or/Not), Query Patterns (from chain-like queries to DAGs and cyclic patterns), and Projected Variables (your favorite relational algebra ).

Open Challenges for NGDBs

Analyzing the taxonomy we find that there is no silver bullet at the moment, eg, most processors can only work in discrete mode with tree-based queries. But it also means there is a lot of room for future work — possibly your contribution!

To be more precise, here are the main NGDB challenges for the following years.

Along the Graph branch:

Modality: Supporting more graph modalities: from classic triple-only graphs to hyper-relational graphs, hypergraphs, and multimodal sources combining graphs, texts, images, and more.
Reasoning Domain: Supporting logical reasoning and neural query answering over temporal and continuous (textual and numerical) data — literals constitute a major portion of graphs as well as relevant queries over literals.
Background Semantics: Supporting complex axioms and formal semantics that encode higher-order relationships between (latent) classes of entities and their hierarchies, \eg, enabling neural reasoning over description logics and OWL fragments.

In the Modeling branch:

Encoder: Inductive encoders supporting unseen relation at inference time — this a key for (1) updatability of neural databases without the need of retraining; (2) enabling the pretrain-finetune strategy generalizing query answering to custom graphs with custom relational schema.
Processor: Expressive processor networks able to effectively and efficiently execute complex query operators akin to SPARQL and Cypher operators. Improving sample efficiency of neural processors is crucial for the training time vs quality tradeoff — reducing training time while maintaining high predictive qualities.
Decoder: So far, all neural query answering decoders operate exclusively on discrete nodes. Extending the range of answers to continuous outputs is crucial for answering real-world queries.
Complexity: As the main computational bottleneck of processor networks is the dimensionality of embedding space (for purely neural models) and/or the number of nodes (for neuro-symbolic), new efficient algorithms for neural logical operators and retrieval methods are the key to scaling NGDBs to billions of nodes and trillions of edges.

In Queries:

Operators: Neuralizing more complex query operators matching the expressiveness of declarative graph query languages, e.g., supporting Kleene plus and star, property paths, filters.
Patterns: Answering more complex patterns beyond tree-like queries. This includes DAGs and cyclic graphs.
Projected Variables: Allowing projecting more than a final leaf node entity, that is, allowing returning intermediate variables, relations, and multiple variables organized in tuples (bindings).
Expressiveness: Answering queries outside simple EPFO and EFO fragments and aiming for the expressiveness of database languages.

Finally, in Datasets and Evaluation:

The need for larger and diverse benchmarks covering more graph modalities, more expressive query semantics, more query operators, and query patterns.
As the existing evaluation protocol appears to be limited (focusing only on inferring hard answers) there is a need for a more principled evaluation framework and metrics covering various aspects of the query answering workflow.

Pertaining to the Neural Graph Storage and NGDB in general, we identify the following challenges:

The need for a scalable retrieval mechanism to scale neural reasoning to graphs of billions of nodes. Retrieval is tightly connected to the Query Processor and its modeling priors. Existing scalable ANN libraries can only work with basic L1, L2, and cosine distances that limit the space of possible processors in the neural query engine.
Currently, all complex query datasets provide a hardcoded query execution plan that might not be optimal. There is a need for a neural query planner that would transform an input query into an optimal execution sequence taking into account prediction tasks, query complexity, type of the neural processor, and configuration of the Storage layer.

Due to encoder inductiveness and updatability without retraining, there is a need to alleviate the issues of continual learning, catastrophic forgetting, and size generalization when running inference on much larger graphs than training ones.

Learn More

NGDB is still an emerging concept with many open challenges for future research. If you want to learn more about NGDB, feel free to check out

📜 our paper (arxiv),
🌐 our website,
🔧 GitHub repo with the most up-to-date list of relevant papers, datasets, and categorization, feel free to open issues and PRs.

We will also be organizing workshops, stay tuned for the updates!

Neural Graph Databases was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Graph ML in 2023: The State of Affairs

Michael Galkin — Sun, 01 Jan 2023 17:58:15 GMT

STATE OF THE ART DIGEST

Hot trends and major advancements

2022 comes to an end and it is about time to sit down and reflect upon the achievements made in Graph ML as well as to hypothesize about possible breakthroughs in 2023. Tune in 🎄☕

Background image generated by DALL-E 2, text added by Author.

The article is written together with Hongyu Ren (Stanford University), Zhaocheng Zhu (Mila & University of Montreal). We thank Christopher Morris and Johannes Brandstetter for the feedback and helping with the Theory and PDE sections, respectively. Follow Michael, Hongyu, Zhaocheng, Christopher, and Johannes here on Medium and Twitter for more graph ml-related discussions.

Table of Contents:

Generative Models: Denoising Diffusion for Molecules and Proteins
DFTs, ML Force Fields, Materials, and Weather Simulations
Geometry & Topology & PDEs
Graph Transformers
BIG Graphs
GNN Theory: Weisfeiler and Leman Go Places, Subgraph GNNs
Knowledge Graphs: Inductive Reasoning Takes Over
Algorithmic Reasoning and Alignment
Cool GNN Applications
Hardware: IPUs and Graphcore win OGB LSC 2022
New Conferences: LoG and Molecular ML
Courses and Educational Materials
New Datasets, Benchmarks, and Challenges
Software Libraries and Open Source
Join the Community
The Meme of 2022

Generative Models: Denoising Diffusion for Molecules and Proteins

Generative diffusion models in the vision-language domain were the headline topic in the Deep Learning world in 2022. While generating images and videos is definitely a cool playground to try out different models and sampling techniques, we’d argue that

the most useful applications of diffusion models in 2022 were actually created in the Geometric Deep Learning area focusing on molecules and proteins

In our recent article, we were pondering whether “Denoising Diffusion Is All You Need?”.

Denoising Diffusion Generative Models in Graph ML

There, we reviewed newest generative models for graph generation (DiGress), molecular conformer generation (EDM, GeoDiff, Torsional Diffusion), molecular docking (DiffDock), molecular linking (DiffLinker), and ligand generation (DiffSBDD). As soon as the post went public, several amazing protein generation models were released:

Chroma from Generate Biomedicines allows to impose functional and geometric constraints, and even use natural language queries like “Generate a protein with CHAD domain” thanks to a small GPT-Neo trained on protein captioning;

Chroma protein generation. Source: Generate Biomedicines

RoseTTaFold Diffusion (RF Diffusion) from the Baker Lab and MIT is packed with the similar functionality also allowing for text prompts like “Generate a protein that binds to X” as well as being capable of functional motif scaffolding, scaffolding enzyme active sites, and de novo protein design. Strong point: 1000 designs generated with RF Diffusion were experimentally synthesized and tested in the lab!

RF Diffusion. Source: Watson et al. BakerLab

The Meta AI FAIR team made amazing progress in protein design purely with language models: mid-2022, ESM-2 was released, a protein LM trained solely on protein sequences that outperforms ESM-1 and other baselines by a huge margin. Moreover, it was then shown that encoded LM representations are a very good starting point for obtaining the actual geometric configuration of a protein without the need for Multiple Sequence Alignments (MSAs) — this is done via ESMFold. A big shoutout to Meta AI and FAIR for publishing the model and the weights: it is available in the official GitHub repo and on HuggingFace as well!

Scaling ESM-2 leads to better folding prediction. Source: Lin, Akin, Rao, Hie et al

🍭 Later on, even more goodies arrived from the ESM team: Verkuil et al. find that ESM-2 can generate de novo protein sequences that can actually be synthesized in the lab and, more importantly, do not have any match among known natural proteins. Hie et al. propose pretty much a new programming language for protein designers (think of it as a query language for ESMFold) — production rules organized in a syntax tree with constraint functions. Then, each program is “compiled” into an energy function that governs the generative process. Meta AI also released the biggest Metagenomic Atlas, but more on that in the Datasets section of this article.

In the antibody design area, a similar LM-based approach is taken by IgLM by Shuai, Ruffolo, and Gray. IGLM generates antibody sequences conditioned on chain and species id tags.

Finally, we’d highlight a few works from Jian Tang’s lab at Mila. MoleculeSTM by Liu et al. is a CLIP-like text-to-molecule model (plus a new large pre-training dataset). MoleculeSTM can do 2 impressive things: (1) retrieve molecules by text description like “triazole derivatives” and retrieve text description from a given molecule in SMILES, (2) molecule editing from text prompts like “make the molecule soluble in water with low permeability” — and the model edits the molecular graph according to the description, mindblowing 🤯

Then, ProtSEED by Shi et al. is a generative model for protein sequence and structure simultaneously (for example, most existing diffusion models for proteins can do only one of those at a time). ProtSEED can be conditioned on residue features or pairs of residues. Model-wise, it is an equivariant iterative model with improved triangular attention. ProtSEED was evaluated on Antibody CDR co-design, Protein sequence-structure co-design, and Fixed backbone sequence design.

Molecule editing from text inputs. Source: Liu et al.

Besides generating the protein structures, there are also some works for generating protein sequences from structures, known as inverse folding. Don’t forget to check out the ESM-IF1 from Meta and the ProteinMPNN from the Baker Lab.

What to expect in 2023: (1) performance improvements of diffusion models such as faster sampling and more efficient solvers; (2) more powerful conditional protein generation models; (3) more successful applications of Generative Flow Networks (GFlowNets, check out the tutorial) to molecules and proteins.

DFTs, ML Force Fields, Materials, and Weather Simulations

AI4Science becomes the frontier of equivariant GNN research and its applications. Pairing GNNs with PDEs, we can now tackle much more complex prediction tasks.

In 2022, this frontier expanded to ML-based Density Functional Theory (DFT) and Force fields approximations used for molecular dynamics and material discovery. The other growing field is Weather simulations.

We would recommend the talk by Max Welling for a broader overview of AI4Science and what is now enabled by using Deep Learning in science.

Starting with models, 2022 has seen a surge in equivariant GNNs for molecular dynamics and simulations, e.g., building upon NequIP, Allegro by Musaelian, Batzner, et al. or MACE by Batatia et al. The design space for such models is very large, so refer to the recent survey by Batatia, Batzner, et al. for an overview. A crucial component for most of them is the e3nn library (paper by Geiger and Smidt) and the notion of tensor product. We highly recommend a great new course by Erik Bekkers on Group Equivariant Deep Learning to understand the mathematical foundations and catch up with the recent papers.

⚛️ Density Functional Theory (DFT) calculations are one of the main workhorses of molecular dynamics (and account for a great deal of computing time in big clusters). DFT is O(n³) to the input size though, so can ML help here? In Learned Force Fields Are Ready For Ground State Catalyst Discovery, Schaarschmidt et al. present the experimental study of models of learned potentials — turns out GNNs can do a very good job in linear O(n) time! The Easy Potentials approach (trained on Open Catalyst data) turns out to be quite a good predictor especially when paired with a postprocessing step. Model-wise, it is an MPNN with the Noisy Nodes self-supervised objective.

In Forces are not Enough, Fu et al. introduce a new benchmark for molecular dynamics — in addition to MD17, the authors add datasets on modeling liquids (Water), peptides (Alanine dipeptide), and solid-state materials (LiPS). More importantly, the authors consider a wide range of physical properties like stability of simulations, diffusivity, and radial distribution functions. Most SOTA molecular dynamics models were probed including SchNet, ForceNet, DimeNet, GemNet (-T and -dT), and NequIP.

Source: Fu et al.

In crystal structure modeling, we’d highlight Equivariant Crystal Networks by Kaba and Ravanbakhsh — a neat way to build representations of periodic structures with crystalline symmetries. Crystals can be described with lattices and unit cells with basis vectors that are subject to group transformations. Conceptually, ECN creates edge index masks corresponding to symmetry groups, performs message passing over this masked index, and aggregates the results of many symmetry groups.

Source: Kaba and Ravanbakhsh

Even more news on material discovery can found in the proceedings of the recent AI4Mat NeurIPS workshop!

☂️ ML-based weather forecasting made a huge progress as well. In particular, GraphCast by DeepMind and Pangu-Weather by Huawei demonstrated exceptionally good results outperforming traditional models by a large margin. While Pangu-Weather leverages 3D/visual inputs and Visual Transformers, GraphCast employs a mesh MPNN where Earth is split into several hierarchy levels of meshes. The deepest level has about 40K nodes with 474 input features and the model outputs 227 predicted variables. The MPNN follows the “encoder-processor-decoder” and has 16 layers. GraphCast is autoregressive model w.r.t. the next timestep prediction, that is, it takes previous two states and predicts the next one. GraphCast can build a 10-day forecast in <60 seconds on a single TPUv4 and is much more accurate than non-ML forecasting models. 👏

Encoder-Processor-Decoder mesh MPNN in GraphCast. Source: Lam, Sanchez-Gonzalez, Willson, Wirnsberger, Fortunato, Pritzel, et al.

What to expect in 2023: We expect to see a lot more focus on computational efficiency and scalability of GNNs. Current GNN-based force-fields are obtaining remarkable accuracy, but are still 2–3 orders of magnitude slower than classical force-fields and are typically only deployed on a few hundred atoms. For GNNs to truly have a transformative impact on materials science and drug discovery, we will see many folks tackling this issue, be it through architectural advances or smarter sampling.

Geometry & Topology & PDEs

In 2022, 1️⃣ we got a better understanding of oversmoothing and oversquashing phenomena in GNNs and their connections to algebraic topology; 2️⃣ using GNNs for PDE modeling is now mainstream.

1️⃣ Michael Bronstein’s lab made huge contributions to this problem — check those excellent posts on Neural Sheaf Diffusion and framing GNNs as gradient flows

Neural Sheaf Diffusion for deep learning on graphs

And on GNNs as gradient flows:

Graph Neural Networks as gradient flows

2️⃣ Using GNNs for PDE modeling became a mainstream topic. Some papers require the 🤯 math alert 🤯 warning, but if you are familiar with the basics of ODEs and PDEs it should be much easier.

Message Passing Neural PDE Solvers by Brandstetter, Worrall, and Welling describe how message passing can help solving PDEs, generalize better, and get rid of manual heuristics. Furthermore, MP-PDEs representationally contain classic solvers like finite differences.

Source: Brandstetter, Worrall, and Welling

The topic was developed further by many recent works including continuous forecasting with implicit neural representations (Yin et al.), supporting mixed boundary conditions (Horie and Mitsume), or latent evolution of PDEs (Wu et al.)

What to expect in 2023: Neural PDEs and their applications are likely to expand to more physics-related AI4Science subfields, where especially computational fluid dynamics (CFD) will potentially be influenced by GNN based surrogates in the coming months. Classical CFD is applied to a wide range of research and engineering problems in many fields of study, including aerodynamics, hypersonic and environmental engineering, fluid flows, visual effects in video games, or weather simulations as discussed above. GNN based surrogates might augment/replace traditional well-tried techniques such as finite element methods (Lienen et al.), remeshing algorithms (Song et al.), boundary value problems (Loetsch et al.), or interactions with triangularized boundary geometries (Mayr et al.).

The neural PDE community is starting to build strong and commonly used baselines and frameworks, which will in return help to accelerate the progress, e.g. PDEBench (Takamoto et al.) or PDEArena (Gupta et al.)

Graph Transformers

Definitely one of the main community drivers in 2022, graph transformers (GTs) evolved a lot towards higher effectiveness and better scalability. Several outstanding models published in 2022:

👑 GraphGPS by Rampášek et al. takes the title of “GT of 2022” thanks to combining local message passing, global attention (optionally, linear for higher efficiency), and positional encodings that led to setting a new SOTA on ZINC and many other benchmarks. Check out a dedicated article on GraphGPS

GraphGPS: Navigating Graph Transformers

GraphGPS served as a backbone of GPS++, the winning OGB Large Scale Challenge 2022 model on PCQM4M v2 (graph regression). GPS++, created by Graphcore, Valence Discovery, and Mila, incorporates more features including 3D coordinates and leverages sparse-optimized IPU hardware (more on that in the following section). GPS++ weights are already available on GitHub!

GraphGPS intuition. Source: Rampášek et al

Transformer-M by Luo et al. inspired many top OGB LSC models as well. Transformer-M adds 3D coordinates to the neat mix of joint 2D-3D pre-training. At inference time, when 3D info is not known, the model would infer a glimpse of 3D knowledge which improves the performance on PCQM4Mv2 by a good margin. Code is available either.

Transformer-M joint 2D-3D pre-training scheme. Source: Luo et al.

TokenGT by Kim et al goes even more explicit and adds all edges of the input graph (in addition to all nodes) to the sequence fed to the Transformer. With those inputs, encoder needs additional token types to distinguish nodes from edges. The authors prove several nice theoretical properties (although at the cost of higher computational complexity O((V+E)²) that can get to the 4th power in the worst case of a fully-connected graph). Code is available.

TokenGT adds both nodes and edges to the input sequence. Source: Kim et al

What to expect in 2023: for the coming year, we’d expect 1️⃣ GTs to scale up along the axes of both data and model parameters, from molecules of <50 nodes to graphs of millions of nodes, in order to witness the emergent behavior as in text & vision foundation models 2️⃣ similar to BLOOM by the BigScience Initiative, a big open-source pre-trained equivariant GT for molecular data, perhaps within the Open Drug Discovery project.

BIG Graphs

🔥 One of our favorites in 2022 is “Graph Neural Networks for Link Prediction with Subgraph Sketching” by Chamberlain, Shirobokov et al. — this is a neat combination of algorithms + ML techniques. It is known that SEAL-like labeling tricks dramatically improve link prediction performance compared to standard GNN encoders but suffer from big computation/memory overhead. In this work, the authors find that obtaining distances from two nodes of a query edge can be efficiently done with hashing (MinHashing) and cardinality estimation (HyperLogLog) algorithms. Essentially, message passing is done over minhashing and hyperloglog initial sketches of single nodes (min aggregation for minhash, max for hyperloglog sketches) — this is the core of the ELPH link prediction model (with a simple MLP decoder). The authors then design a more scalable BUDDY model where k-hop hash propagation can be precomputed before training. Experimentally, ELPH and BUDDY scale to large graphs that were previously way too large or resource hungry for labeling trick approaches. Great work and definitely a solid baseline for all future link prediction models! 👏

The motivation behind computing subgraph hashes to estimate cardinalities of neighborhoods and intersections. Source: Chamberlain, Shirobokov et al.

On the graph sampling and minibatching side, Gasteiger, Qian, and Günnemann design Influence-based Mini-Batching (IBMB), a good example how Personalized PageRank (PPR) can solve even graph batching! IBMB aims at creating the smallest minibatches whose nodes have the maximum influence on the node classification task. In fact, the influence score is equivalent to PPR. Practically, given a set of target nodes, IBMB (1) partitions the graph into permanent clusters, (2) runs PPR within each batch to select top-PPR nodes that would form a final subgraph minibatch. The resulting minibatches can be sent to any GNN encoder. IBMB is pretty much constant O(1) to the graph size where partitioning and PPRs can be precomputed at the pre-processing stage.

Although the resulting batches are fixed and do not change over training (not stochastic enough), the authors design momentum-like optimization terms to mitigate this non-stochasticity. IBMB can be used both in training and inference with massive speedups — up to 17x and 130x, respectively 🚀

Influence-based mini-batching. Source: Gasteiger, Qian, and Günnemann

The subtitle of this subsection could be “brought to you by Google” since the majority of the papers have authors from Google ;)

Carey et al. created Stars, a method for building sparse similarity graphs at the scale of tens of trillions of edges 🤯. Pairwise N² comparisons would obviously not work here — Stars employs two-hop spanner graphs (those are the graphs where similar points are connected with at most two hops) and SortingLSH that together enable almost linear time complexity and high sparsity.

Dhulipala et al. created ParHAC, an approximate (1+𝝐) parallel algorithm for hierarchical agglomerative clustering (HAC) on very large graphs and extensive theoretical foundations of the algorithm. ParHAC has O(V+E) complexity and poly-log depth and runs up to 60x faster than baselines on graphs with hundreds of billions of edges (here it is the Hyperlink graph with 1.7B nodes and 125B edges).

Devvrit et al. created S³GC, a scalable self-supervised graph clustering algorithm with one-layer GNN and constrastive training objective. S³GC uses both graph structure and node features and scales to graphs of up to 1.6B edges.

Finally, Epasto et al. created a differentially-private modification of PageRank!

LoG 2022 featured two tutorials on large-scale GNNs: Scaling GNNs in Production by Da Zheng, Vassilis N. Ioannidis, and Soji Adeshina and Parallel and Distributed GNNs by Torsten Hoefler and Maciej Besta.

What to expect in 2023: further reduction in compute costs and inference time for very large graphs. Perhaps models for OGB LSC graphs could run on commodity machines instead of huge clusters?

GNN Theory: Weisfeiler and Leman Go Places, Subgraph GNNs

Tourists of the year! Source of the original portraits: Towards Geometric Deep Learning IV: Chemical Precursors of GNNs by Michael Bronstein

🏖 🌄 Weisfeiler and Leman, grandfathers of Graph ML and GNN theory, had a very prolific traveling year! After visiting Neural, Sparse, Topological, and Cellular places in previous years, in 2022 we have seen them in several new places:

WL Go Machine Learning — a comprehensive survey by Morris et al on the basics of the WL test, terminology, and various applications;
WL Go Relational — the first attempt by Barcelo et al to study expressiveness of relational GNNs used in multi-relational graphs and KGs. Turns out R-GCN and CompGCN are equally expressive and are bounded by the Relational 1-WL test, and the most expressive message function (aggregating entity-relation representations) is a Hadamard product;
WL Go Walking by Niels M. Kriege studies expressiveness of random walk kernels and finds that the RW kernel (with a small modification) is as expressive as a WL subtree kernel;
WL Go Geometric: Joshi, Bodnar et al propose Geometric WL test (GWL) to study expressiveness of equivariant and invariant GNNs (to ceratin symmetries: translation, rotation, reflection, permutation). Turns out, equivariant GNNs (such as E-GNN, NequIP or MACE) are provably more powerful than invariant GNNs (such as SchNet or DimeNet);
WL Go Temporal: Souza et al propose Temporal WL test to study expressiveness of temporal GNNs. The authors then propose a novel injective aggregation function (and the PINT model) that should be most expressive;
WL Go Gradual: Bause and Kriege propose to modify the original WL color refinement with a non-injective function where different multi-sets might get assigned the same color (under certain conditions). It thus enables more gradual color refinement and slower convergence to stable coloring that eventually retains expressiveness of 1-WL but gets a few distinguishing properties on the way.
WL Go Infinite: Feldman et al propose to change the initial node coloring with spectral features derived from the heat kernel of the Laplacian or with k-smallest eigenvectors of the Laplacian (for large graphs) which is quite close to Laplacian Positional Encodings (LPEs).
WL Go Hyperbolic: Nikolentzos et al note that the color refinement procedure of the WL test produces a tree hierarchy of colors. In order to preserve relative distances of nodes encoded by those colors, the authors propose to map output states of each layer/iteration into a hyperbolic space and update it after each next layer. The final embeddings are supposed to retain the notion of node distances.

📈 In the realm of more expressive (than 1-WL) architectures, subgraph GNNs are the biggest trend. Among those, three approaches stand out: 1️⃣ Subgraph Union Networks (SUN) by Frasca, Bevilacqua, et al. that provide a comprehensive analysis of subgraph GNNs design space and expressiveness showing they are bounded by 3-WL; 2️⃣ Ordered Subgraph Aggregation Networks (OSAN) by Qian, Rattan, et al devise a hierarchy of subgraph-enhanced GNNs (k-OSAN) and find that k-OSAN are incomparable to k-WL but are strictly limited by (k+1)-WL. One particularly cool part of OSAN is using Implicit MLE (NeurIPS’21), a differentiable discrete sampling technique, for sampling ordered subgraphs. ️3️⃣ SpeqNets by Morris et al. devise a permutation-equivariant hierarchy of graph networks that balances between scalability and expressivity. 4️⃣ GraphSNN by Wijesinghe and Wang derives expressive models based on the overlap of subgraph isomorphisms and subtree isomorpishms.

🤔 A few works rethink the WL framework as a general means for GNN expressiveness. Geerts and Reutter define k-order MPNNs that can be characterized with Tensor Languages (with a mapping between WL and Tensor Languages). A new anonymous ICLR’23 submission proposes to leverage graph biconnectivity and defines a Generalized Distance WL algorithm.

If you’d like to study the topic even deeper, check out a wonderful LOG 2022 tutorial by Fabrizio Frasca, Beatrice Bevilacqua, and Haggai Maron with practical examples!

What to expect in 2023: 1️⃣ More efforts on creating time- and memory-efficient subgraph GNNs. 2️⃣ Better understanding of generalization of GNNs. 3️⃣ Weisfeiler and Leman visit 10 new places!

Knowledge Graphs: Inductive Reasoning Takes Over

Last year, we observed a major shift in KG representation learning: transductive-only approaches are being actively retired in favor of inductive models that can build meaningful representation for new, unseen nodes and perform node classification and link prediction.

In 2022, the field was expanding along two main axes: 1️⃣ inductive link prediction (LP) 2️⃣ and inductive (multi-hop) query answering that extends link prediction to much more complex prediction tasks.

1️⃣ In link prediction, the majority of inductive models (like NBFNet or NodePiece) transfer to unseen nodes at inference by assuming that the set of relation types is fixed during training and does not change over time so they can learn relation embeddings. What happens when the set of relations changes as well? In the hardest case, we’d want to transfer to KGs with completely different nodes and relation types.

So far, all such models supporting unseen relations resort to meta-learning which is slow and resource-hungry. In 2022, for the first time, Huang, Ren, and Leskovec proposed the Connected Subgraph Reasoner (CSR) framework that is inductive along both entities and relation types and does not need any meta-learning! 👀 Generally, for new relations at inference, models see at least k example triples with this relation (hence, a k-shot learning scenario). Conceptually, CSR extracts subgraphs around each example trying to learn common relational patterns (i.e., optimizing edge masks) and then apply the mask to the query subgraph (with the missing target link to predict).

Inductive CSR that supports KGs with unseen entities and relation types. Source: Huang, Ren, and Leskovec

ReFactor GNNs by Chen et al. is another insightful work on inductive qualities of shallow KG embedding models — particularly, the authors find that shallow factorization models like DistMult resemble infinitely deep GNNs when looking through the lens of backpropagation and how nodes update their representations from neighboring and non-neighboring nodes. Turns out that, theoretically, any factorization model can be turned into an inductive model!

2️⃣ Inductive representation learning arrived in the area of complex logical query answering as well. (shameless plug) In fact, it was one of the focuses of our team this year 😊 First, in Zhu et al., we found that Neural Bellman-Ford nets generalize well from simple link prediction to complex query answering tasks in a new GNN Query Executor (GNN-QE) model where a GNN based on NBF-Net performs relation projections while other logical operators are performed via fuzzy logic t-norms. Then, in Inductive Logical Query Answering in Knowledge Graphs we studied ⚗️ the essence of inductiveness ⚗️ and proposed two ways to answer logical queries over unseen entities at inference time, that is, via (1) inductive node representations obtained with NodePiece encoder paired with the inference-only decoder (less performant but scalable) or via (2) inductive relational structure representations akin to the one in GNN-QE (better quality but more resource-hungry and hard to scale). Overall we are able to scale to an inductive query setting on graphs with millions of nodes and 500k unseen nodes and 5m unseen edges during inference.

Inductive logical query answering approaches: via node representations (NodePiece-QE) and relational structure representations (GNN-QE). Source: Galkin et al.

The other cool work in the area is SMORE by Ren, Dai, et al. — it is a large-scale (transductive-only yet) system for complex query answering over very large graphs scaling up to the full Freebase with about 90M nodes and 300M edges 👀. In addition to CUDA, training, and pipeline optimizations, SMORE implements a bidirectional query sampler such that training queries can be generated on-the-fly right in the data loader instead of creating and storing huge datasets. Don’t forget to check out a fresh hands-on tutorial on large-scale graph reasoning from LOG 2022!

Last but not the least, Yang, Lin and Zhang brought up an interesting paper rethinking the evaluation of knowledge graph completion. They point out knowledge graphs tend to be open-world (i.e., there are facts not encoded by the knowledge graph) rather close-world assumed by most works. As a result, metrics observed under the close-world assumption exhibit a log trend w.r.t. the true metric — this means if you get 0.4 MRR for your model, chances are that the test knowledge graph is incomplete and your model has already done a good job👍. Maybe we can design some new dataset and evaluation to mitigate such an issue?

What to expect in 2023: an inductive model fully transferable to different KGs with new sets of entities and relations, e.g., training on Wikidata, and running inference on DBpedia or Freebase.

Algorithmic Reasoning and Alignment

2022 was a year of major breakthroughs and milestones for algorithmic reasoning.

1️⃣ First, the CLRS benchmark by Veličković et al. is now available as the main playground to design and benchmark algorithmic reasoning models and tasks. CLRS already includes 30 tasks (such as classical sorting algorithms, string algorithms, and graph algorithms) but still allows you to bring your own formulations or modify existing ones.

2️⃣ Then, a Generalist Neural Algorithmic Learner by Ibarz et al. and DeepMind has shown that it is possible to train a single processor network that can be trained in the multi-task mode on different algorithms — previously, you’d train a single model for a single task repeating that for all 30 CLRS problems. The paper also describes several modifications and tricks to the model architecture side and training procedure to let the model generalize better and prevent forgetting, e.g., triplet reasoning similar to triangular attention (common for molecular models) and edge transformers. Overall, a new model brings a massive 25% absolute gain over baselines and solves 24 out of 30 CLRS tasks with 60%+ micro-F1.

Source: Ibarz et al.

3️⃣ Last year, we discussed the works on algorithmic alignment and saw the signs that GNNs can probably align well with dynamic programming. In 2022, Dudzik and Veličković prove that GNNs are Dynamic Programmers using category theory, abstract algebra, and notion of pushforward and pullback operations. This is a wonderful example of applying category theory that many people consider “abstract nonsense” 😉. Category theory is likely to have more impact in GNN theory and Graph ML in general, so check out a fresh course Cats4AI for a gentle introduction to the field.

4️⃣ Finally, the work of Beurer-Kellner et al. is one of the first practical application of the neural algorithmic reasoning framework, here it is applied to configuring computer networks, i.e., routing protocols like BGP that are at the core of the internet. There, the authors show that representing a routing config as a graph allows to frame the routing problem as node property prediction. This approach brings whopping 👀 490x 👀 speedups compared to traditional rule-based routing methods and stil maintain 90+% specification consistency.

Source: Beurer-Kellner et al.

If you want to follow algorithmic reasoning more closely, don’t miss a fresh LoG 2022 tutorial by Petar Veličković, Andreea Deac and Andrew Dudzik.

What to expect in 2023: 1️⃣ Algorithmic reasoning tasks are likely to scale to graphs of thousands of nodes and practical applications like in code analysis or databases, 2️⃣ even more algorithms in the benchmark, 3️⃣ most unlikely — there will appear a model capable of solving quickselect 😅

Cool GNN Applications

👃 Learning to Smell with GNNs. Back in 2019, Google AI started a project on learning representations of smells. From basic chemistry we know that aromaticity depends on the molecular structure, e.g., cyclic compounds. In fact, the whole group of ”aromatic hydrocarbons” was named aromatic because they actually has some smell (compared to many non-organic molecules). If we have a molecular structure, we can employ a GNN on top of it and learn some representations!

Recently, Google AI released a new blogpost and paper by Qian et al. describing the next phase of the project — the Principal Odor Map that is able to group molecules in “odor clusters”. The authors conducted 3 cool experiments: classifying 400 new molecules never smelled before and comparison to the averaged rating of a group of human panelists; linking odor quality to fundamental biology; and probing aromatic molecules on their mosquito repelling qualities. The GNN-based model shows very good results — now we can finally claim that GNNs can smell! Looking forward for GNNs transforming the perfume industry.

Embedding of odors. Source: Google AI blog

⚽ GNNs + Football. If you thought that sophisticated GNNs for modelling trajectories are only used for molecular dynamics and arcane quantum simulations, fear not! Here is a cool practical application with a very high potential outreach: Graph Imputer by Omidshafiei et al., DeepMind, and FC Liverpool predicts trajectories of football players (and the ball). Each game graph consists of 23 nodes, gets updated with a standard message passing encoder and a special time-dependent LSTM. The dataset is quite novel, too — it consists of 105 English Premier League matches (avg 90 min each), all players and the ball were tracked at 25 fps, and the resulting training trajectory sequences encode about 9.6 seconds of gameplay.

The paper is easy to read and has numerous football illustrations, check it out! Sports tech is actively growing those days, and football analysts now could go even deeper in studying their competitors. Will EPL clubs compete for GNN researchers in the upcoming transfer windows? Time to create transfermarkt for GNN researchers 😉

Football match simulation is like molecular dynamics simulation! Source: DeepMind

🪐 Galaxies and Astrophysics. For astrophysics aficionados: Mangrove by Jespersen et al. applies GraphSAGE to merger trees of dark matter to predict a variety of galactic properties like stellar mass, cold gas mass, star formation rate, and even black hole mass. The paper is a bit heavy on the terminology of astrophysics but pretty easy in terms of GNN parameterization and training. Mangrove works 4–9 orders of magnitude faster than standard models. Experimental charts are pieces of art that you can hang on a wall 🖼️.

Mangrove approach to present dark matter halos as merger trees and graphs. Source: Jespersen et al.

🤖 GNNs for code. Code generation models like AlphaCode and Codex have mindblowing capabilities. Although LLMs are at the core of those models, GNNs do help in a few neat ways: Instruction Pointer Attention GNNs (IPA-GNNs) first proposed by Bieber et al have been used to predict runtime errors in competitive programming tasks — so it is almost like a virtual code interpreter! CodeTrek by Pashakhanloo et al. proposes to model a program as a relational graph and embed it via random walks and Transformer encoder. Downstream applications include variable misuse, prediction exceptions, predicting shadowed variables.

Source: Pashakhanloo et al.

Hardware: IPUs and Graphcore Win OGB Large-Scale Challenge 2022

🥇 2022 brought a huge success to Graphcore and IPUs — the hardware optimized for sparse operations that are so needed when working with graphs. The first success story was optimizing Temporal Graph Nets (TGN) for IPUs with massive performance gains (check the article in Michael Bronstein’s blog).

Accelerating and scaling Temporal Graph Networks on the Graphcore IPU

Later on, Graphcore stormed the leaderboards of OGB LSC’22 by winning 2 out of 3 tracks: link prediction on the WikiKG90M v2 knowledge graph and graph regression on the PCQM4M v2 molecular dataset. In addition to the sheer compute power, the authors took several clever model decisions: for link prediction it was Balanced Entity Sampling and Sharing (BESS) for training an ensemble of shallow LP models (check the blog post by Daniel Justus for more details), and GPS++ for the graph regression task (we covered GPS++ above in the GT section). You can try out the pre-trained models using IPUs-powered virtual machines on Paperspace. Congratulations to Graphcore and their team! 👏

PyG partnered with NVIDIA (post) and Intel (post) to increase the performance of core operations on GPUs and CPUs, respectively. Similarly, DGL incorporated new GPU optimizations in the recent 0.9 version. Massive gains for sparse matmuls and sampling procedures, so we’d encourage you to update your environments with the most recent versions!.

What to expect in 2023: major GNN libraries are likely to increase the breadth of supported hardware backends such as IPUs or upcoming Intel Max Series GPUs.

New Conferences: Learning of Graphs (LoG) and Molecular ML (MoML)

This year we witnessed the inauguration of two graph and geometric ML conferences: the Learning on Graphs Conference (LoG) and the Molecular ML Conference (MoML).

LoG is a more general all-around GraphML venue (held virtually this year) while MoML (held at MIT) has a broader mission and influence over the AI4Science community where graphs and geometry still plays a major role. Both conferences were received extremely well. MoML attracted 7 top speakers and 38 posters, LoG had ~3000 registrations, 266 submissions, 71 posters, 12 orals, and 7 awesome tutorials (all recordings of oral talks and tutorials are already on YouTube). Besides, LoG introduced a great monetary incentive for reviewers, resulting in a well-recognized improvement of the review quality! From our point of view, quality of LoG reviews was often better than those at NeurIPS or ICML.

This is a huge win and carnival for the graph ML community, and congrats to everyone working in the field of graph and geometric machine learning with a new “home” venue!

What to expect in 2023: LOG and MoML become main Graph ML venues to include into your submission calendar along with ICLR / NeurIPS / ICML

Courses and Educational Materials

Geometric Deep Learning Course — Second Edition (2022) is already on YouTube. The main entry point to the field.
An Introduction to Group Equivariant Deep Learning by Erik Bekkers — one of the best new courses about equivariance and equivariant models!
Cats4AI — a new course by Andrew Dudzik, Bruno Gavranović, João Guilherme Araújo, Petar Veličković, and Pim de Haan is the best place to learn about category theory and its connections to Geometric DL.
Summer School proceedings: Italian Summer School on Geometric DL, London Geometry and Machine Learning (LOGML) Summer School, BIRS Workshop on Topological Representation Learning.
Stanford Graph Learning Workshop 2022 — latest news from PyG developers and partners and Stanford researchers.

New Datasets, Benchmarks, and Challenges

OGB Large-Scale Challenge 2022: The second large scale challenge held at NeurIPS2022 with large and realistic graph ML tasks covering node-, edge-, graph-level predictions.
Open Catalyst 2022 Challenge: the second edition of the challenge held at NeurIPS2022 with the task to design new machine learning models to predict the outcome of catalyst simulations used to understand activity
CASP 15: the protein structure prediction challenge disrupted by AlphaFold a few years ago at CASP 14. Detailed analysis is yet to come, but it seems that MSAs strike back and best performing models still rely on MSAs.
Long Range Graph Benchmark: for measuring GNNs and GTs capabilities of capturing long range interactions in graphs.
Taxonomy of Graph Benchmarks, Graph Learning Indexer: deeper studies of the dataset landscape in Graph ML outlining open challenges in benchmarking and trustworthiness of results.
GraphWorld: a framework for analyzing the performance of GNN architectures on millions of synthetic benchmark datasets
Chartalist — a collection of blockchain graph datasets
PEER protein learning benchmark: a multi-task benchmark for protein sequence understanding with 17 tasks of protein understanding lying in 5 task categories.
ESM Metagenomic Atlas: acomprehensive database of over 600 million predicted protein structures with nice visualizations and search UI.

Software Libraries and Open Source

Mainstream graph ML libraries: PyG 2.2 (PyTorch), DGL 0.9 (PyTorch, TensorFlow, MXNet), TF GNN (TensorFlow) and Jraph (Jax)
TorchDrug and TorchProtein: machine learning library for drug discovery and protein science
PyKEEN: the best platform for training and evaluating knowledge graph embeddings
Graphein: a package that provides a number of types of graph-based representations of proteins
GRAPE and Marius: scalable graph processing and embedding libraries over billion-scale graphs
MatSci ML Toolkit: a flexible framework for deep learning on the opencatalyst dataset
E3nn: the go-to library for E(3) equivariant neural networks

Join the Community

Reading Groups: Learning on Graphs and Geometry (LOG2) reading group, Molecular Modeling & Drug Discovery (M2D2) reading group, and their Slack communities
Learning of Graphs (LoG) Slack community
Michael Bronstein’s blog on Medium
PyG medium, blog posts, and newsletter
GraphML Telegram channel

The Meme of 2022 🪓

Created by Michael Galkin and Michael Bronstein

Graph ML in 2023: The State of Affairs was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Denoising Diffusion Generative Models in Graph ML

Michael Galkin — Sat, 26 Nov 2022 21:42:17 GMT

What’s new in Graph ML?

Is Denoising Diffusion all you need?

The breakthrough in Denoising Diffusion Probabilistic Models (DDPM) happened about 2 years ago. Since then, we observe dramatic improvements in generation tasks: GLIDE, DALL-E 2, Imagen, Stable Diffusion for images, Diffusion-LM in language modeling, diffusion for video sequences, and even diffusion for reinforcement learning.

Diffusion might be the biggest trend in GraphML in 2022 — particularly when applied to drug discovery, molecules and conformer generation, and quantum chemistry in general. Often, they are paired with the latest advancements in equivariant GNNs.

Molecule generation. Generated with Stable Diffusion 2

The Basics: Diffusion and Diffusion on Graphs

Let’s recapitulate the basics of diffusion models using the example of the Equivariant Diffusion paper by Hoogeboom et al using as few equations as possible 😅

Forward and backward diffusion processes. Forward process q(z|x,h) gradually adds noise to the graph up to the stage when it becomes a Gaussian noise. Backward process p(x,h|z) starts from the Gaussian noise and gradually denoises the graph up to the stage when it becomes a valid graph. Source: Hoogeboom, Satorras, Vignac, and Welling.

Input: a graph (N,E) with N nodes and E edges
Node features often have two parts: z=[x,h] where x ∈ R³ are 3D coordinates and h ∈ R^d are categorical features like atom types
(Optional) Edge features are bond types
Output: a graph (N,E) with nodes, edges, and corresponding features
Forward diffusion process q(z_t | x,h): at each time step t, inject noise to the features such that at the final step T they become a white noise
Reverse diffusion process p(z_{t-1} | z_t): at each time step t-1, ask the model to predict the noise and “subtract” it from the input such that at the final step t=0 we have a new valid generated graph
A denoising neural network learns to predict the injected noise
Denoising diffusion is known to be equivalent to score-based matching [Song and Ermon (2019) and Song et al. (2021)] where a neural network learns to predict the score ∇_x log p_t(x) of the diffused data. The score-based perspective describes the forward/reverse processes with Stochastic Differential Equations (SDEs) with the Wiener process

Emiel Hoogeboom, Victor Garcia Satorras, Clément Vignac, Max Welling. Equivariant Diffusion for Molecule Generation in 3D. ICML 2022. GitHub

The work introduces an equivariant diffusion model (EDM) for molecule generation that has to maintain E(3) equivariance over atom coordinates x (as to rotation, translation, reflection) and while node features h (such as atom types) remain invariant. Importantly, atoms have different feature modalities: atom charge is an ordinal integer, atom types are one-hot categorical features, and atom coordinates are continuous features, so the authors design feature-specific noising processes and loss functions, and scale input features for training stability.

EDM employs an equivariant E(n) GNN as a neural network that predicts noise based on input features and time step. At inference time, we first sample the desired number of atoms M, then we can condition EDM on a desired property c, and ask EDM to generate molecules (defined by features x and h) as x, h ~ p(x,h | c, M).

Experimentally, EDM outperforms normalizing flow- and VAE-based approaches by a large margin in terms of achieved negative log-likelihood, molecule stability, and uniqueness. Ablations demonstrate that an equivariant GNN encoder is crucial as replacing it with a standard MPNN leads to significant performance drops.

Diffusion-based generation visualization. Source: Twitter

DiGress: Diffusion for Graph Generation

Clement Vignac, Igor Krawczuk, Antoine Siraudin, Bohan Wang, Volkan Cevher, Pascal Frossard. DiGress: Discrete Denoising diffusion for graph generation. GitHub

DiGress by Clemént Vignac, Igor Krawczuk, and the EPFL team is the unconditional graph generation model (although with the possibility to incorporate a score-based function for conditioning on graph-level features like energy MAE). DiGress is a discrete diffusion model, that is, it operates on discrete node types (like atom types C, N, O) and edge types (like single / double / triple bond) where adding noise to a graph corresponds to multiplication with the transition matrix (from one type to another) mined as marginal probabilities from the training set. The denoising neural net is a modified Graph Transformer. DiGress works for many graph families — planar, SBMs, and molecules, code is available, and check the video from the LoGaG reading group presentation!

DiGress diffusion process. Source: Vignac, Krawczuk, et al.

GeoDiff and Torsional Diffusion: Molecular Conformer Generation

Having a molecule with 3D coordinates of its atoms, conformer generation is the task of generating another set of valid 3D coordinates with which a molecule can exist. Recently, we have seen GeoDiff and Torsional Diffusion that applied the diffusion framework to this task.

Minkai Xu, Lantao Yu, Yang Song, Chence Shi, Stefano Ermon, Jian Tang. GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation. ICLR 2022. GitHub

GeoDiff is the SE(3)-equivariant diffusion model for generating conformers of given molecules. Diffusion is applied to 3D coordinates that gradually get transformed to Gaussian noise (forward process). The reverse process denoises a random sample to a valid set of atomic coordinates. GeoDiff defines an equivariant diffusion framework in the Euclidean space (that postulates which kind of noise can be added) and applies an equivariant GNN as the denoising model. The denoising GNN, a Graph Field Network, is an extension of rather standard EGNNs. For the first time, GeoDiff showed how much better the diffusion models are compared to normalizing flows and VAE-based models 💪

GeoDiff. Source: Xu et al.

Bowen Jing, Gabriele Corso, Jeffrey Chang, Regina Barzilay, Tommi Jaakkola. Torsional Diffusion for Molecular Conformer Generation. NeurIPS 2022. GitHub

While GeoDiff diffuses 3D coordinates of atoms in the Euclidean space, Torsional Diffusion proposes a neat way to perturb torsion angles in freely rotatable bonds of molecules. Since the number of such rotatable bonds is always much smaller than the number of atoms (on average in GEOM-DRUGS, 44 atoms vs 8 torsion angles per molecule), generation can potentially be much faster. The tricky part is that torsion angles do not form a Euclidean space, but rather a hypertorus (a donut 🍩), so adding some Gaussian noise to coordinates won’t work — instead, the authors design a novel perturbation kernel as the wrapped normal distribution (from real space but modulated by 2pi). Torsional Diffusion applies the score-based perspective to training and generation where the score model has to be SE(3)-invariant and sign-equivariant. The score model is a variation of the Tensor Field Network.

Experimentally, Torsional Diffusion indeed works faster — it only needs 5–20 steps compared to 5000 steps of GeoDiff, and is currently a SOTA in conformer generation 🚀

Torsional Diffusion. Source: Jing, Corso, et al.

DiffDock: Diffusion for Molecular Docking

Gabriele Corso, Hannes Stärk, Bowen Jing, Regina Barzilay, Tommi Jaakkola. DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking. GitHub

DiffDock is the score-based generative model for molecular docking, eg, given a ligand and a protein, predicting how a ligand binds to a target protein. DiffDock runs the diffusion process over translations T(3), rotations SO(3), and torsion angles SO(2)^m in the product space: (1) positioning of the ligand wrt the protein (often called binding pockets), the pocket is unknown in advance so it is blind docking, (2) defining rotational orientation of the ligand, and (3) defining torsion angles of the conformation (see the Torsional Diffusion above for reference).

DiffDock trains 2 models: the score model for predicting actual coordinates and the confidence model for estimating the likelihood of the generated prediction. Both models are SE(3)-equivariant networks over point clouds, but the bigger score model (in terms of parameter count) works on protein residues from alpha-carbons (initialized from the now-famous ESM2 protein LM) while the confidence model uses the fine-grained atom representations. Initial ligand structures are generated by RDKit. DiffDock dramatically improves the prediction quality and you can even upload your own proteins (PDB) and ligands (SMILES) in the online demo on HuggingFace spaces to test it out!

DiffDock intuition. Source: Corso, Stärk, Jing, et al.

DiffSBDD: Diffusion for Generating Novel Ligands

Arne Schneuing, Yuanqi Du, Charles Harris, Arian Jamasb, Ilia Igashov, Weitao Du, Tom Blundell, Pietro Lió, Carla Gomes, Max Welling, Michael Bronstein, Bruno Correia. Structure-based Drug Design with Equivariant Diffusion Models. GitHub

DiffSBDD is the diffusion model for generating novel ligands conditioned on the protein pocket. DiffSBDD can be implemented with 2 approaches: (1) pocket-conditioned ligand generation when the pocket is fixed; (2) inpainting-like generation that approximates the joint distribution of pocket-ligand pairs. In both approaches, DiffSBDD relies on the tuned equivariant diffusion model (EDM, ICML 2022) and equivariant EGNN as the denoising model. Practically, ligands and proteins are represented as point clouds with categorical features and 3D coordinates (proteins can be alpha-carbon residues or full atoms, one-hot encoding of residues — ESM2 could be used here in future), so diffusion is performed over the 3D coordinates ensuring equivariance.

DiffSBDD. Source: Schneuing, Du, et al.

DiffLinker: Diffusion for Generating Molecular Linkers

Ilya Igashov, Hannes Stärk, Clément Vignac, Victor Garcia Satorras, Pascal Frossard, Max Welling, Michael Bronstein, Bruno Correia. Equivariant 3D-Conditional Diffusion Models for Molecular Linker Design. GitHub

DiffLinker is the diffusion model for generating molecular linkers conditioned on 3D fragments. While previous models are autoregressive (hence not permutation equivariant) and can only link 2 fragments, DiffLinker generates the whole structure and can link 2+ fragments. In DiffLinker, each point cloud is conditioned on the context (all other known fragments and/or protein pocket), the context is usually fixed. The diffusion framework is similar to EDM but is now conditioned on the 3D data rather than on scalars. The denoising model is the same equivariant EGNN. Interestingly, DiffLinker has an additional module to predict the linker size (number of molecules) so you don’t have to specify it beforehand.

DiffLinker. Source: Igashov et al.

Learn More

SMCDiff for generating protein scaffolds conditioned on the desired motif (also with EGNN).
Generally, in graph and molecule generation we’d like to support some discreteness, so any improvements to the discrete diffusion are very welcome, eg, Richemond, Dieleman, and Doucet propose a new simplex diffusion for categorical data with the Cox-Ingersoll-Ross SDE (rare find!).
Discrete diffusion is also studied for text generation in the recent DiffusER.
Hugging Face maintains the 🧨 Diffusers library starts the open course on Diffusion Models — check them out for practical implementation tips
Check the recordings of the CVPR 2022 tutorial on diffusion models by Karsten Kreis, Ruiqi Gao, and Arash Vahdat

We’ll spare your browser tabs for now 📚 but do expect more diffusion models in Geometric DL!

A special thanks goes to Hannes Stärk and Ladislav Rampášek for proofreading the post! Follow Hannes, Ladislav, and me on Twitter, or subscribe to the GraphML channel in Telegram.

Molecule generation. Generated with Stable Diffusion 2

Denoising Diffusion Generative Models in Graph ML was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Graph Machine Learning @ ICML 2022

Michael Galkin — Mon, 25 Jul 2022 06:40:43 GMT

What’s New in GraphML?

Recent advancements and hot trends, July 2022 edition

International Conference on Machine Learning (ICML) is one of the premier venues where researchers publish their best work. ICML 2022 was packed with hundreds of papers and numerous workshops dedicated to graphs. We share the overview of the hottest research areas 🔥 in Graph ML.

This post was written by Michael Galkin (Mila) and Zhaocheng Zhu (Mila).

We did our best to highlight the major advances in Graph ML at ICML and cover 2–4 papers per topic. Still, due to the sheer volume of accepted papers, we might have missed some works - let us know in comments or on social media.

Table of Contents (clickable):

Generation: Denoising Diffusion Is All You Need
Graph Transformers
Theory and Expressive GNNs
Spectral GNNs
Explainable GNNs
Graph Augmentation: Beyond Edge Dropout
Algorithmic Reasoning and Graph Algorithms
Knowledge Graph Reasoning
Computational Biology: Molecular Linking, Protein Binding, Property Prediction
Cool Graph Applications

Generation: Denoising Diffusion Is All You Need

Denoising diffusion probabilistic models (DDPMs) are taking over the field of Deep Learning in 2022 in pretty much all domains with stunning generation quality and better theoretical properties than GANs and VAEs, e.g., in image generation (GLIDE, DALL-E 2, Imagen), video generation, text generation (Diffusion-LM), and even diffusion for reinforcement learning. Conceptually, diffusion models gradually add noise to an input object (until it is a Gaussian noise) and learn to predict the added level of noise such that we can subtract it from the object (denoise).

Diffusion might be the biggest trend in GraphML in 2022 — particularly when applied to drug discovery, molecules and conformers generation, and quantum chemistry in general. Often, they are paired with the latest advancements in equivariant GNNs. ICML features several cool implementations of denoising diffusion for graph generation.

➡️ In “Equivariant Diffusion for Molecule Generation in 3D” by Hoogeboom, Satorras, Vignac, and Welling, the authors define an equivariant diffusion model (EDM) for molecule generation that has to maintain E(3) equivariance over atom coordinates x (as to rotation, translation, reflection) and invariance to group transformations over node features h. Importantly, molecules have different feature modalities: atom charge is an ordinal integer, atom types are one-hot categorical features, and atom coordinates are continuous features, so, for instance, you can’t just add Gaussian noise to one-hot features and expect the model to work. Instead, the authors design feature-specific noising processes and loss functions, and scale input features for training stability.

EDM employs a state-of-the-art E(n) GNN as a neural network that predicts noise based on input features and time step. At inference time, we first sample the desired number of atoms M, then we can condition EDM on a desired property c, and ask EDM to generate molecules (defined by features x and h) as x, h ~ p(x,h | c, M).

Forward and backward diffusion. Source: Hoogeboom, Satorras, Vignac, and Welling.

Diffusion-based generation visualization. Source: Twitter

➡️ For 2D graphs, Jo, Lee, and Hwang propose Graph Diffusion via the System of Stochastic Differential Equations (GDSS). While the previous EDM is an instance of denoising diffusion probabilistic model (DDPM), GDSS belongs to a sister branch of DDPMs, namely, score-based models. In fact, it was recently shown (ICLR’21) that DDPMs and score-based models can be unified into the same framework if we describe the forward diffusion process with stochastic differential equations (SDEs).

SDEs allow to model diffusion in continuous time as Wiener process (for simplicity, let’s say it is a fancy term for the process of adding noise) while DDPMs usually discretize it in 1000 steps (with learnable time embedding) although SDEs would require using specific solvers. Compared to previous score-based graph generators, GDSS takes as input (and predicts) both adjacency A and node features X. The forward and backward diffusion expressed as SDEs require computing scores — here gradients of joint log-densities (X, A). For obtaining those densities, we need a score-based model, and here the authors use a GNN with attention pooling (graph multi-head attention).

At training time, we solve a forward SDE and train a score model, while at inference we use the trained score model and solve the reverse-time SDE. Usually, you’d employ something like Langevin dynamics here, e.g., Langevin MCMC, but higher-order Runge-Kutta solvers should, in principle, work here, too. Experimentally, GDSS outperforms autoregressive generative models and one-shot VAEs by a large margin in 2D graph generation tasks, although sampling speed might still be a bit of a bottleneck due to integrating reverse-time SDEs. GDSS code is already available!

GDSS intuition. Source: Jo, Lee, and Hwang

👀 Looking at arxiv those days, we’d expect many more diffusion models to be released this year — DDPMs in graphs deserve their own big blog post, stay tuned!

➡️ Finally, an example of a non-diffusion generation is the work by Martinkus et al who design SPECTRE, a GAN for one-shot graph generation. Apart from othen GANs who would often generate an adjacency matrix right away, the idea of SPECTRE is to condition graph generation on top-k (lowest) eigenvalues and eigenvectors of a Laplacian that already give some notion of clusters and connectivity. 1️⃣ SPECTRE generates k eigenvalues. 2️⃣ the authors use a clever trick of sampling eigenvectors from the Stiefel manifold induced by top-k eigenvectors. The Stiefel manifold presents a bank of orthonormal matrices from which we can sample one n x k matrix. 3️⃣Finally, obtaining a Laplacian, the authors use a Provably Powerful Graph Net to generate the final adjacency.

Experimentally, SPECTRE is orders of magnitude better than other GANs and up to 30x faster than autoregressive graph generators 🚀.

SPECTRE a 3-step process to generate eigenvalues -> eigenvectors -> adjacency. Source: Martinkus et al

Graph Transformers

We have two papers on improving Graph Transformers at this year’s ICML.

➡️ First, Chen, O’Bray, and Borgwardt propose a Structure-Aware Transformer (SAT). They notice that self-attention can be rewritten as kernel smoothing where the query-key product is an exponential kernel. It then boils down to finding a more generalized kernel — the authors propose to use functions of a node and graph to add structure awareness, namely, k-subtree and k-subgraph features. K-subtrees are essentially k-hop neighborhoods and can be mined relatively fast, but are eventually limited to the expressiveness of 1-WL. On the other hand, k-subgraphs are more expensive to compute (and hardly scale) but provide a provably better distinguishing power.

Whatever featurization you select, those subtrees or subgraphs (extracted for each node) are then encoded through any GNN encoder (eg, PNA), pooled (sum/mean/virtual node), and used as queries and keys in the self-attention computation (see the illustration 👇).

Experimentally, k of 3 or 4 is enough, and k-subgraph features work expectedly better on graphs where we can afford their computation. Interestingly, positional features like Laplacian eigenvectors and Random Walk features are only helpful for the k-subtree SAT being rather useless for k-subgraph SAT.

Source: Chen, O’Bray, and Borgwardt

➡️ Second, Choromanski, Lin, Chen, et al (the team overlaps a lot with the authors of the famous Performer with linear attention) study principled mechanisms to enable sub-quadratic attention. In particular, they consider relative positional encodings (RPEs) and their variations for different data modalities like images, sounds, video, and graphs. Considering graphs, we know from Graphormer that infusing shortest path distances into attention works well, but requires materialization of the full attention matrix (hence, not scalable). Can we approximate the softmax attention without full materialization but still incorporate useful graph inductive biases? 🤔

Yes! And the authors propose 2 such mechanisms. (1) Turns out, we can use Graph Diffusion Kernels (GDK) — a.k.a. heat kernels — that model a diffusion process of heat propagation and serve as a soft version of shortest paths. Diffusion, however, requires calling solvers for computing matrix exponentials, so the authors design another way. (2) Random Walks Graph-Nodes Kernel (RWGNK) which value (for two nodes) encodes the dot product of frequency vectors (of those two nodes) obtained from random walks starting at those two nodes.

Random walks are great, we love random walks 😍 Check out the illustration below for a visual description of diffusion and RW kernel results. The final transformer with the RWGNK kernel is called Graph Kernel Attention Transformer (GKAT) and probed against several tasks from synthetic identification of topological structures in ER graphs to small compbio and social network datasets. GKAT shows much better results on synthetic tasks and performs pretty much on par with GNNs on other graphs. Would be great to see a real scalability study pushing the Transformer to the limits of input set size!

Source: Choromanski, Lin, Chen, et al

Theory and Expressive GNNs

The GNN community continues to study the ways of breaking through the ceiling of 1-WL expressiveness and retaining at least polynomial time complexity.

➡️ Papp and Wattenhofer start with an accurate description of current theoretical studies:

Whenever a new GNN variant is introduced, the corresponding theoretical analysis usually shows it to be more powerful than 1-WL, and sometimes also compares it to the classical k-WL hierarchy… Can we find a more meaningful way to measure the expressiveness of GNN extensions?

The authors categorize the literature of expressive GNNs into 4 families: 1️⃣ k-WL and approximations; 2️⃣ substructure counting (S); 3️⃣ subgraph- and neighborhood-aware GNNs (N) (covered extensively in the recent post by Michael Bronstein); 4️⃣ GNNs with marking — those are node/edge perturbation approaches and node/edge labeling approaches (M). Then, the authors come up with the theoretical framework of how all those k-WL, S, N, and M families are related and which one is more powerful to what extent. The hierarchy is more fine-grained than k-WL to help designing GNNs expressive enough to cover particular downstream tasks and save compute.

The proposed hierarchy of different expressive GNN families. N=subgraph GNNs, S=substructure counting, M=GNNs with markings. Source: Papp and Wattenhofer

➡️ Perhaps the tastiest ICML’22 work is cooked by chefs Morris et al with 🥓SpeqNets 🥓(Speck is bacon in German). Known higher-order k-WL GNNs either operate on k-order tensors or consider all k-node subgraphs, implying an exponential dependence on k in memory requirements and do not adapt to the sparsity of the graph. SpeqNets introduce a new class of heuristics for the graph isomorphism problem, the (k,s)-WL, which offers a more fine-grained control between expressivity and scalability.

Essentially, the algorithm is a variant of the local k-WL but only considers specific tuples to avoid the exponential memory complexity of the k-WL. Concretely, the algorithm only considers k-tuples or subgraphs on k nodes with at most s connected components, effectively exploiting the potential sparsity of the underlying graph — varying k and s leads to a tradeoff between scalability and expressivity on the theoretical side.

The authors derive a new hierarchy of permutation-equivariant graph neural networks, denoted SpeqNets, based on the above combinatorial insights, reaching universality in the limit. These architectures vastly reduce computation times compared to standard higher-order graph networks in the supervised node- and graph-level classification and regression regime, significantly improving standard graph neural network and graph kernel architectures in predictive performance.

The hierarchy of 🥓 SpeqNets 🥓. Source: Morris et al

➡️ Next, Huang et al take an unorthodox look at permutation-invariant GNNs and suggest that carefully designed permutation-sensitive GNNs are actually more expressive. The theory of Janossy pooling says a model becomes invariant to a group of transformations if we show all possible examples of such a transformation, and for permutation of n elements we have an intractable n! permutations. Instead, the authors show that considering only pairwise 2-ary permutations of a node’s neighborhood is enough and is provably more powerful than 2-WL and not less powerful than 3-WL.

Practically, the proposed PG-GNN extends the idea of GraphSAGE and encodes every random permutation of node’s neighborhood through a 2-layer LSTM instead of traditional sum/mean/min/max. Additionally, the authors design a linear permutation sampling approach based on Hamiltonian cycles.

PG-GNN permutation-sensitive aggregation idea. Source: Huang et al

Some other interesting works you might want to check:

Cai and Wang study convergence properties of Invariant Graph Networks, different from vanilla MPNNs in that they operate on node and edge features as equivariant operations over monolithic tensors. Based on the graphon theory, the authors find a class of IGNs that provably converge. More technical details are in the awesome Twitter thread!
Gao and Ribeiro study ⏳ temporal GNNs ⏳devising two families: (1) time-and-graph — where we first embed graph snapshots via some GNN and then apply some RNN; (2) time-then-graph where we first encode all node and edge features (over a unified graph of all snapshots) through an RNN, and only then apply a single GNN pass, e.g., TGN and TGAT can be considered instances of this family. Theoretically, the authors find that time-then-graph are more expressive then time-and-graph when using standard 1-WL GNN encoders like GCN or GIN, and propose a simple model with a GRU time encoder and a GCN graph encoder. The model shows very competitive performance on temporal node classification and regression tasks being 3–10x faster and GPU memory-efficient. Interestingly, the authors find that neither time-and-graph nor time-then-graph is expressive enough for temporal link prediction 🤔.
Finally, “Weisfeiler-Lehman Meets Gromov-Wasserstein” by Chen, Lim, Mémoli, Wan, Wang (a joint 5-first authors paper 👀) derives a polynomial-time WL distance from the WL kernel such that we can measure a dissimilarity of two graphs — the WL distance is 0 if and only if they cannot be distinguished by the WL test, and positive iff they can be distinguished. The authors further realize that the proposed WL distance has deep connections to the Gromov-Wasserstein distance!

How Weisfeiler-Leman meets Gromov-Wasserstein in practice. Should have been in the paper by Chen, Lim, Mémoli, Wan, Wang. Source: Tenor

Spectral GNNs

➡️ Spectral GNNs tend to be overlooked in the mainstream of spatial GNNs , but now there is a reason for you to take a look at spectral GNNs 🧐. In “How Powerful are Spectral Graph Neural Networks” by Wang and Zhang, the authors show that a linear spectral GNN is a universal approximator for any function on a graph under some mild assumptions. What’s even exciting is that the assumptions turn out to be empirically true for real-world graphs, suggesting that a linear spectral GNN is powerful enough for the node classification task.

But how do we explain the difference in the empirical results of spectral GNNs? The authors prove that different parameterizations (specifically, polynomial filters) of the spectral GNNs influence the convergence speed. We know that the condition number of Hessian matrix (how round is the iso-loss line) is highly related to the convergence speed. Based on this intuition, the authors come up with some orthogonal polynomials to benefit optimization. The polynomials, named as Jacobi bases, are a generalization of the Chebyshev bases used in ChebyNet. Jacobi bases are defined by two hyperparameters, a and b. By tuning these hyperparameters, one may find a group of bases in favor of the input graph signal.

Experimentally, JacobiConv performs well on both homophilic and heterophilic datasets, even as a linear model. Probably it’s the time to desert those gaudy GNNs, at least for the node classification task 😏.

➡️ There are two more papers on spectral GNNs. One is Graph Gaussian Convolutional Networks (G2CN) based on spectral concentration analysis that shows good results on heterophilic datasets. The other one from Yang et al analyzes the correlation issue in graph convolutions based on spectral smoothness that shows an exceptionally good result of 0.0698 MAE on ZINC.

Explainable GNNs

As most GNN models are black boxes, it is important to explain the predictions of GNNs for applications in crucial areas. This year we have two awesome papers in this direction, an efficient and powerful post-hoc model from Xiong et al, and an inherently interpretable model from Miao et al.

➡️ Xiong et al extend their previous GNN explanation method, GNN-LRP, to be way more scalable. Unlike other methods (GNNExplainer, PGExplainer, PGM-Explainer), GNN-LRP is a higher-order subgraph attribution method that considers the joint contribution of nodes in a subgraph. Such a property is necessary for tasks where a subgraph is not simply a set of nodes. For example, in molecules, a subgraph of six carbons (hydrogens are ignored) can be either a benzene (a ring) or a hexane (a chain). As shown in the below figure, a higher-order method can figure out such subgraphs (right) while a lower-order method (left) may not.

Source: Xiong et al.

However, the drawback of GNN-LRP is that it needs to compute the gradient w.r.t. each random walk in a subgraph, which takes O(|S|L) for a subgraph S and L-hop random walks. Here dynamic programming comes to the rescue 😎. Notice that the gradient w.r.t. a random walk is multiplicative (chain rule), and different random walks are aggregated by summation. This can be efficiently computed by the sum-product algorithm. The idea is to use the distributive property of summation over multiplication (more generally, semiring), and aggregate partial random walks at each step. This constitutes the model, subgraph GNN-LRP (sGNN-LRP).

sGNN-LRP also improves over GNN-LRP with a generalized subgraph attribution, which considers both random walks in the subgraph S and its complement G\S. Though complicated as it looks, the generalized subgraph attribution can be computed by two sum-product algorithm passes. Experimentally, sGNN-LRP not only finds attributions better than all existing explanation methods, but also runs as fast as a regular message passing GNN. Might be a useful tool for interpretation and visualization! 🔨

💡 By the way, it is not new to see that models based on random walks are more expressive than simple node or edge models. The NeurIPS’21 paper NBFNet solves knowledge graph reasoning with random walks and dynamic programming, and achieves amazing results in both transductive and inductive settings.

➡️ Miao et al take another perspective and study inherently interpretable GNN models. They show that post-hoc explanation methods, such as GNNExplainer, are subpar for interpretation since they merely use a fixed pretrained GNN model. By contrast, an inherently interpretable GNN that jointly optimizes the predictor and the interpretation modules, is a better solution. Following this idea, the authors derive graph stochastic attention (GSAT) from the graph information bottleneck (GIB) principle. GSAT encodes the input graph, and randomly samples a subgraph (interpretation) from the posterior distribution. It makes the prediction based on the sampled subgraph. As an advantage, GSAT doesn’t need to constrain the size of a sampled subgraph.

Source: Miao et al

Experimentally, GSAT is much better than post-hoc methods in terms of both interpretation and prediction performance. It can also be coupled with a pretrained GNN model. GSAT should be a good candidate if you are building interpretable GNNs for your applications.

Graph Augmentation: Beyond Edge Dropout

This year brought a few works on improving self-supervised capabilities of GNNs that go beyond random edge index perturbations like node/edge dropout.

➡️ Han et al bring the idea of mixups used in image augmentation since 2017 to graphs with G-Mixup (Outstanding Paper Award at ICML 2022 🏅). The idea of mixups is to take two images, mix their features together and mix their labels together (according to a pre-defined weighting factor), and ask the model to predict this label. Such a mixup improves robustness and generalization qualities of classifiers.

But how do we mix two graphs that in general might have different numbers of nodes and edges?

The authors find the elegant answer — let’s mix not graphs, but their graphons — that are, in simple words, graph generators. Graphs coming from the same generator have the same underlying graphon. So the algorithm becomes rather straightforward (see the illustration below) — for a pair of graphs, we 1️⃣ estimate their graphons; 2️⃣ mix up two graphons into a new one through a weighed sum; 3️⃣ sample a graph from the new graphon and its new label; and 4️⃣ send this to a classifier. In the illustrative example, we have two graphs with 2 and 8 connected components, respectively, and after mixing their graphons we get a new graph of 2 major communities with 4 minor in each. Estimating graphons can be done with a step function and several methods different in computational complexity (the authors mostly resort to “largest gap”).

Experimentally, G-Mixup stabilizes model training, performs better or on-par with traditional node/edge perturbation methods, but outperforms them by a large margin in the robustness scenarios with label noise or many added/removed edges. Cool adaptation of a well-known augmentation method to graphs 👏! If you are interested, ICML’22 offers a few more general works on mixups: a study how mixups improve calibration and how to use them in generative models.

G-Mixup. Source: Han et al

➡️ Liu et al take another look at augmentation, particularly in setups where nodes have a small neighborhood. The idea of Local Augmentation GNNs (LA-GNN) is in training a generative model to yield an additional feature vector for each node. The generative model is a conditional VAE trained (on the whole graph) to predict features of connected neighbors conditioned on a center node. That is, once CVAE is trained, we just pass a feature vector of each node and get another feature vector that is supposed to capture more information than plain neighbors.

We then concat two feature vectors per node and send it to any downstream GNN and task. Note that CVAE is pre-trained before and doesn’t need to be trained with a GNN. Interestingly, CVAE can generate features for unseen graphs, i.e., local augmentation can be used in inductive tasks as well! The initial hypothesis is confirmed experimentally — the augmentation approach works particularly well for nodes of small degrees.

The Local Augmentation idea. Source: Liu et al

➡️ Next, Yu, Wang, Wang, et al, tackle the GNN scalability task where using standard neighbor samplers a-la GraphSAGE might lead to exponential neighborhood size expansion and stale historical embeddings. The authors propose GraphFM, a feature momentum approach, where historical node embeddings get updates from their 1-hop neighbors through a momentum step. Generally, momentum updates are often seen in SSL approaches like BYOL and BGRL for updating model parameters of a target network. Here, GraphFM employs momentum to alleviate the variance of historical reprepsentations in different mini-batch sizes and provide an unbiased estimation of feature updates for differently-sized neighborhoods.

Generally, GraphFM has two options GraphFM-InBatch and GraphFM-OutOfBatch. (1) GraphFM-InBatch works for the GraphSAGE-style neighbor sampling by dramatically reducing the number of necessary neighbors — whereas GraphSAGE required 10–20 depending on the level, GraphFM needs only 1 random neighbor per node per layer. Only one 👌! And (2) GraphFM Out-of-batch builds on top of GNNAutoScale where we first apply graph partitioning to cut graphs into k minibatches.

Experimentally, feature momentum looks especially useful for SAGE-style sampling (the in-batch version) — seems like a good default choice for all neighbor sampling-based approaches!

Compared to GNNAutoScale (GAS), historical node states are also updated from new embeddings and feature momentum (moving average). Source: Yu, Wang, Wang, et al

➡️ Finally, Zhao et al propose a clever augmentation trick for link prediction based on counterfactual links. In essence, the authors ask:

“would the link still exist if the graph structure became different from observation?”

It means that we would like to find links that are structurally similar to the given link according to some 💊 treatment (here, those are classical metrics like SBM clustering, k-core decomposition, Louvain, and more) but give the opposite result. With CFLP, the authors hypothesize that training a GNN to correctly predict both true and counterfactual links helps the model to get rid of spurious correlations and capture only meaningful features for inferring a link between two nodes.

After obtaining a set of counterfactual links (pre-processing step based on the chosen treatment function) CFLP is first trained on both factual and counterfactual links, then the link prediction decoder is fine-tuned with some balancing and regularization terms. In some sense, the approach resembles mining hard negatives to augment the set of true positive links 🤔Experimentally, CFLP paired with a GNN encoder largely outperforms results of that single GNN encoder on Cora/Citeseer/Pubmed, and is still in top-3 of OGB-DDI link prediction task!

Counterfactual links (right). Source: Zhao et al

Algorithmic Reasoning and Graph Algorithms

🎆 A huge milestone for the algorithmic reasoning community — the appearance of the CLRS benchmark (named after a classical textbook Introduction to Algorithms by Cormen, Leiserson, Rivest, and Stein) by Veličković et al! Now, there is no need to invent toy evaluation tasks — CLRS contains 30 classical algorithms (sort, search, MST, shortest paths, graphs, dynamic programming, and many more) converting an ICPC data generator into an ML dataset 😎.

In CLRS, each dataset element is a trajectory, i.e., a collection of inputs, outputs, and intermediate steps. The underlying representation format is a set of nodes (not often a graph as edges might not be necessary), for example, sorting a list of 5 elements is framed as operations over a set of 5 nodes. Trajectories consist of probes — tuples of format (stage, location, type, values) that encode a current execution step of an algorithm with its states. Output decoder depends on the expected type — in the example illustration 👇 sorting is modeled with pointers.

Split-wise, training and validation trajectories have 16 nodes (e.g., sort lists of length 16), but the test set probes out-of-distribution (OOD) capabilities of models on tasks with 64 nodes. Interestingly, vanilla GNNs and MPNNs fit training data very well but underperform in the OOD setup where Pointer Graph Network shows better numbers. It is a one more data point to the collection of observations that GNNs can’t generalize to larger inference graphs — it’s still an open question how to fix this 🤔 . The code is already available and could be extended with more custom algorithmic tasks.

Representation of hints in CLRS. Source: Veličković et al

➡️ On a more theoretical side, Sanmartín et al generalize the notion of graph metrics through the Algebraic Path Problem (APP). APP is a more high-level framework (with some roots in the category theory) unifying many existing graph metrics like shortest path, commute cost distance, and minimax distance through the notion of semirings — algebraic structures over sets with specific operators and properties. For instance, shortest paths can be described as a semiring with “min” and “+” operators with neutral elements “+inf” and “0”.

Here, the authors create a single APP framework of log-norm distances that allows to interpolate between shortest paths, commute costs, and minimax using only two parameters. In essence, you could vary and mix the influence of edge weights and surrounding graph structure (other paths) on the final distance. Although there are no experiments, this is a solid theoretical contribution — if you are learning category theory as “eating your veggies” 🥦, this paper is a blast to read — and will surely find applications in GNNs. 👏

Log-norm distances. Source: Sanmartín et al

➡️ Finally, we’d add to this category a work “Learning to Infer the Structures of Network Games” by Rossi et al who combine graph theory with game theory. Game theory is used a lot in economics and other multidisciplinary studies, you’ve probably heard about the Nash Equilibrium that defines a solution for non-cooperative games. In this work, the authors consider 3 game types: linear quadratic, linear influence, and Barik-Honorio graphical games. Games are usually defined through their utility functions, but in this work, we assume we don’t know anything about game’s utility function.

Games are defined as N players (nodes in a graph) that take specific actions (for simplicity, let’s say we can describe it with a certain numerical feature, check the illustration below 🖼️). Actions can influence neighboring players — and the task is framed as inferring the graph of players given their actions. In essence, this is a graph generation task — given node features X, predict a (normalized) adjacency matrix A. Usually, a game is played K times, and those are independent games, so the encoder model should be invariant to permutations of games (and equivariant to permutation of nodes in each game). The authors propose the NuGgeT 🍗 encoder-decoder model where a transformer encoder processes K games by N player, yields latent representations, and decoder is an MLP over a sum of a Hadamard product of latent pairwise player features such that the decoder is permutation-invariant to the order of K games.

Experimentally, the model works well on both synthetic and real datasets. The paper is definitely a “broaden your horizon” 🔭 work that you might not expect to see at ICML, but later find a fascinating reading and learning a lot of new concepts 👏.

Source: Rossi et al

Knowledge Graph Reasoning

Knowledge graph reasoning has long been a playground for GraphML methods. In this year’s ICML, there are quite a few interesting papers on this topic. As a trend of this year, we see a significant drift from embedding methods (TransE, ComplEx, RotatE, HAKE) to GNNs and logic rules (in fact, GNNs are also related to logic rules). There are four papers based on GNNs or logic rules, and two papers extending the conventional embedding methods.

➡️ Let’s begin with the cycle basis GNN (CBGNN) proposed by Yan et al. The authors draw an interesting connection between logic rules and cycles. For any chain-like logic rule, the head and the body of the logic rule always form a cycle in the knowledge graph. For example, the right plot of the following figure shows the cycle for (X, part of Y) ∧ (X, lives in, Z) → (Y, located in Z). In other words, the inference of a logic rule can be viewed as predicting the plausibility of a cycle, which boils down to learning the representations of cycles.

Blue and Red triangles are cycles within the bigger Green cycle. Source: Yan et al

An interesting observation is that cycles form a linear space under modulo-2 addition and multiplication. In the above example, the summation of the red ❤️ and blue 💙 cycles, which cancels out their common edge, results in the green 💚 cycle. Therefore, we don’t need to learn the representation of all cycles, but instead, only a few cycle bases of the linear space. The authors generate the cycle bases by picking cycles that have large overlapping with the shortest path tree. To learn the representation of cycles, they create a cycle graph, where each node is a cycle in the original graph, and each edge indicates overlapping between two cycles. A GNN is applied to learn the node (which are cycles of the original graph) representations in the cycle graph.

CBGNN encoding. Source: Yan et al

To apply CBGNN to inductive relation prediction, the authors construct an inductive input representation for each cycle by encoding the relations in the cycle with an LSTM. Experimentally, CBGNN achieves SotA results on the inductive version of FB15k-237/WN18RR/NELL-995.

➡️ Next, Das and Godbole et al propose CBR-SUBG, a case-based reasoning (CBR) method for KBQA. The core idea is to retrieve similar query-answer pairs from the training set when solving a query. We know the idea of retrieval is very popular in OpenQA task (EMDR, RAG, KELM, Mention Memory LMs), but this is the first time to see such an idea adopted on graphs.

Given a natural language query, CBR first retrieves similar k-nearest neighbor (kNN) queries based on the query representation encoded by a pretrained language model. All the retrieved queries are from the training set, and therefore their answers are accessible. Then we generate a local subgraph for each query-answer pair, which is believed to be the reasoning pattern (though not necessarily exact) for the answer. The local subgraph of the current query (for which we can’t access the answer) is generated by following the relation paths in the subgraphs of its kNN queries. CBR-SUBG then applies a GNN to every subgraph, and predicts the answer by comparing the node representations with answers in the KNN queries.

Case-based reasoning intuition. Source: Das and Godbole et al

➡️ There are two neural-symbolic methods for reasoning this year. The first one is hierarchical rule induction (HRI) from Glanois et al. HRI extends a previous work, logic rule induction (LRI) on inductive logic programming. The idea of rule induction is to learn a bunch of rules and apply them to deduce facts like forward chaining.

In both LRI and HRI, each fact P(s,o) is represented by a predicate embedding 𝜃p and a valuation vp (i.e. the probability of the fact being true). Each rule P(X,Y) ← P1(X,Z) ∧ P2(Z,Y) is represented by the embeddings of its predicates. The goal is to iteratively apply rules to deduce new facts. During each iteration, the rules and facts are matched through soft unification, which measures whether two facts satisfy certain rules in the embedding space. Once a rule is selected, a new fact is generated and added to the set of facts. All the embeddings and the soft unification operation are trained end-to-end to maximize the likelihood of observed facts.

The HRI model improves over the LRI model in three aspects: 1) use a hierarchical prior that separates the rules used in each iteration step. 2) use gumbel-softmax to induce a sparse and interpretable solution for soft unification. 3) prove the set of logic rules that HRI can express.

Hierarchical Rule Induction. Source: Glanois et al

➡️ The second one is the GNN-QE paper from Zhu et al (disclaimer: a paper from the authors of this blog post). GNN-QE solves complex logical query on knowledge graphs with GNNs and fuzzy sets. It enjoys the advantages of both neural (e.g. strong performance) and symbolic (e.g. interpretability) methods. As there is a lot of interesting stuff in GNN-QE, we will have a separate blog post for it soon. Stay tuned! 🤗

➡️ Finally, Kamigaito and Hayashi study the theoretical and empirical effects of negative sampling in knowledge graph embeddings. Starting from RotatE, knowledge graph embedding methods use a normalized negative sampling loss, plus a margin binary cross entropy loss. This is different from the negative sampling used in the original word2vec. In this paper, the authors prove that the normalized negative sampling loss is necessary for distance-based models (TransE, RotatE) to reach the optimal solution. The margin also plays an important role in distance-based models. The optimal solution can only be reached if 𝛾 ≥ log|V|, which is consistent with the empirical results. Based on this conclusion, now we can determine the optimal margin without hyperparameter tuning! 😄

Computational Biology: Molecular Linking, Protein Binding, Property Prediction

Generally, comp bio is represented at ICML pretty well. Here, we’ll have a look at new approaches for molecular linking, protein binding, conformer generation, and molecular property prediction.

Molecular linking is a crucial part in designing Proteolysis targeting chimera (PROTAC) drugs. For us, mere GNN researchers 🤓 without biological background, it means that given two molecules, we want to generate a valid linker molecule that would attach two fragment molecules in a single molecule while retaining all properties of the original fragment molecules (check the illustration below for a good example)

➡️ For generating molecular links, Huang et al created 3DLinker, an E(3)-equivariant generative model (VAE) that sequentially generates atoms (and connecting bonds) with absolute coordinates. Often, equivariant models generate relative coordinates or relative distance matrices, but here, the authors aim at generating absolute (x, y, z) coordinates. To allow a model to generate exact coordinates from equivariant (to coordinates) and invariant (to nodes features) transformations, the authors apply a clever idea of Vector Neurons which is essentially a ReLU-like nonlinearity for preserving feature equivariance with clever orthogonal projection tricks.

The E(3)-equivariant encoder enriched with Vector Neurons encodes features and coordinates while the decoder sequentially generates the link in 3 steps (illustrated belos as well): 1️⃣ predict an anchor node to which the link will be attached; 2️⃣ predict node type for a linker node; 3️⃣ predict edge and its absolute coordinates; 4️⃣ repeat until we hit the stop node in the second fragment. 3DLinker is (so far) the first equivariant model that generates the linker molecule with exact 3D coordinates and predicts the anchor points in fragment molecules — previous models required known anchors before generation — and shows the best experimental results.

3DLinker intuition. Source: Huang et al

➡️ Protein-ligand binding is the other crucial drug discovery task — predicting where a small molecule could potentially attach to a certain region of a bigger protein. First, Stärk, Ganea, et al create EquiBind (ICML Spotlight 💡) that takes as input a protein and a random RDKit conformer of a ligand graph, and outputs precise 3D location of the binding interaction. EquiBind has already garnered a very warm reception and publicity as in MIT News and reading groups so we encourage you to have a detailed look at technical details! EquiBind is orders of magnitude faster than commercial software while maintaining high prediction accuracy.

EquiBind. Source: Stärk, Ganea, et al

➡️ If the binding molecule is unknown and we want to generate such a molecule, Liu et al create GraphBP, an autoregressive molecule generation approach that takes as input a target protein site (denoted as initial context). Encoding the context with any 3D GNN (SchNet here), GraphBP generates atom type and spherical coordinates until there are no more contacting atoms available or the desired number of atoms is reached. Once the atoms are generated, the authors resort to OpenBabel to create bonds.

Generating a binding molecule with GraphBP. Source: Liu et al

➡️ In molecular property prediction, Yu and Gao propose a simple and surprisingly powerful idea to enrich molecular representations with a bag of motifs. That is, they first mine a vocabulary of motifs in the training dataset and rank them according to TF-IDF scores (hello from NLP 😉). Then, each molecule can be represented as a bag of motifs (multi-hot encoding) and the whole dataset of molecules is converted to one heterogeneous graph with relations “motif-molecule” if any molecule contains this motif, and “motif-motif” if any two motifs share an edge in any molecule. Edge features are those TF-IDF scores mined before.

The final embedding of a molecule is obtained through a concatenation of any vanilla GNN over the molecule and another heterogeneous GNN over a sampled subgraph from the motif graph. Such a Heterogeneous Motif GNN (HM-GNN) consistently outperforms Graph Substructure Networks (GSN), one of the first GNN architectures that proposed to count triangles in social networks and k-cycles in molecules, and even Cell Isomorphism Networks (CIN), a top-notch higher-order message passing model. HM-GNNs can serve as a simple powerful baseline for subsequent research in the area of higher-order GNNs 💪.

Building a motif vocabulary in HM-GNN. Source: Yu and Gao

➡️ Finally, a work by Stärk et al demonstrates the benefits of pre-training GNNs both on 2D molecular graphs and their 3D conformers with the 3D Infomax approach. The idea of 3D Infomax is in maximizing mutual information between 2D and 3D representations such that at inference time over 2D graphs, when no 3D structure is given, the model could still benefit from implicit knowledge of the 3D structure.

For that, 2D molecules are encoded with the Principal Neighborhood Aggregation (PNA) net, 3D conformers are encoded with the Spherical Message Passing (SMP) net, we take the cosine similarity of their representations and pass through the contrastive loss maximizing the similarity of a molecule with its true 3D conformers and treating other samples as negatives. Having pre-trained 2D and 3D nets, we can fine-tune the weights of the 2D net on a downstream task — QM9 property prediction in this case — and the results definitely show that pretraining works. By the way, if you are further interested in pre-training, you can check out GraphMVP published at ICLR 2022 as another 2D/3D pre-training approach.

In 3D Informax, we first pre-train 2D and 3D nets, and use a trained 2D net at inference time. Source: Stärk et al

Cool Graph Applications

Physical simulation along with molecular dynamics received a huge boost with GNNs. A standard setup of physical simulation is a system of particles where node features are several recent velocities and edge features are relative displacements, and the task is to predict where the particles move at the next time step.

⚛️ This year, Rubanova, Sanchez-Gonzalez et al further improve physical simulations by incorporating explicit scalar constraints in the C-GNS (Constraint-based Graph Network Simulator). Conceptually, the output of an MPNN encoder is further refined through a solver that minimizes some learned (or specified at inference time) constraint. The solver itself is a differentiable function (5-iteration gradient descent in this case) so we can backprop through the solver as well. C-GNS is inherently connected to deep implicit layers that are getting more and more visibility including the GNN applications.

Physical simulation works are often a source of fancy simulation visualizations — check out the website with video demos!

Constraint-based Graph Network Simulator. Source: Rubanova, Sanchez-Gonzalez et al

A few other cool applications you might want to have a look at:

Traffic Prediction: Lan, Ma, et al created DSTA-GNN (Dynamic Spatial-Temporal Aware Graph Neural Network) for traffic prediction 🚥evaluated on real-world datasets of busy California roads — predicting traffic with graphs received a boost last year after the massive work by Google and DeepMind’s on improving Google Maps ETA which we covered in 2021 results.
Neural Network Pruning: Yu et al design GNN-RL to iteratively prune weights of deep neural nets given a desired ratio of FLOPs reduction. For that, the authors treat a neural net’s computational graph as a hierarchical graph of blocks and send it to a hierarchical GNN (with intermediate learnable pooling to coarse-grain the NN architecture). Encoded representations are sent to the RL agent that decides which block to prune.
Ranking: He et al tackle an interesting task — given a matrix of pairwise interactions, e.g., between teams in a football league where Aij > 0 means team i got a better score than team j, find the final ranking of nodes (teams) who scored best. In other words, we want to predict who is the winner of a league after seeing pair-wise results of all games. The authors propose GNNRank that represents pairwise results as a directed graph and applies a directional GNN to get latent node states and compute the Fiedler vector of the graph Laplacian. Then, they frame the task as a constrained optimization problem with proximal gradient steps as we can’t easily backprop through the computation of the Fiedler vector.

That’s finally it for ICML 2022! 😅

Looking forward to seeing NeurIPS 2022 papers as well as submissions to the brand-new Learning on Graphs (LoG) conference!

Graph Machine Learning @ ICML 2022 was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

GraphGPS: Navigating Graph Transformers

Michael Galkin — Tue, 14 Jun 2022 02:04:43 GMT

Recent Advances in Graph ML

Recipes for cooking the best graph transformers

In 2021, graph transformers (GT) won recent molecular property prediction challenges thanks to alleviating many issues pertaining to vanilla message passing GNNs. Here, we try to organize numerous freshly developed GT models into a single GraphGPS framework to enable general, powerful, and scalable graph transformers with linear complexity for all types of Graph ML tasks. Turns out, just a well-tuned GT is enough to show SOTA results on many practical tasks!

Message passing GNNs, fully-connected Graph Transformers, and positional encodings. Image by Authors

This post was written together with Ladislav Rampášek, Dominique Beaini, and Vijay Prakash Dwivedi and is based on the paper Recipe for a General, Powerful, Scalable Graph Transformer (2022) by Rampášek et al. You can also follow me, Ladislav, Vijay, and Dominique on Twitter.

Outline:

Message Passing GNNs vs Graph Transformers
Pros, Cons, and Variety of Graph Transformers
The GraphGPS framework
General: The Blueprint
Powerful: Structural and Positional Features
Scalable: Linear Transformers
Recipe time — how to get the best out of your GT

Message Passing GNNs vs Graph Transformers

Message passing GNNs (conventionally analyzed from the Weisfeiler-Leman perspective) notoriously suffer from over-smoothing (increasing the number of GNN layers, the features tend to converge to the same value), over-squashing (losing information when trying to aggregate messages from many neighbors into a single vector), and perhaps most importantly, poor capturing of long-range dependencies which is noticeable already on small but sparse molecular graphs.

Today, we know many ways to break through the glass ceiling of message passing — including higher-order GNNs, better understanding of graph topology, diffusion models, graph rewiring, and graph transformers!

Whereas in the message passing scheme a node’s update is a function over its neighbors, in GTs, a node’s update is a function of all nodes in a graph (thanks to the self-attention mechanism in the Transformer layer). That is, an input to a GT instance is the whole graph.

Updating a target node representation (red), local message passing aggregates only immediate neighbors while global attention is a function of all nodes in a graph. In GraphGPS, we combine both! Image by Authors

Pros and Cons of Graph Transformers

Feeding the whole graph into the Transformer layer brings several immediate benefits and drawbacks.

✅ Pros:

Akin to graph rewiring, we now decouple the node update procedure from the graph structure.
No problem handling long-range connections as all nodes are now connected to each other (we often separate true edges coming from the original graph and virtual edges added when computing the attention matrix — check the illustration above where solid lines denote real edges and dashed lines — virtual ones).
Bringing the “Navigating a Maze” analogy, instead of walking and looking around, we can use a map, destroy maze walls, and use magic wings 🦋. We have to learn the map beforehand though, and later we’ll see how to make the navigation more precise and efficient.

Graph Transformers give you wings 🦋. Source: Dominique Beaini @ Twitter

🛑 The cons are similar to those stemming from using transformers in NLP:

Whereas language input is sequential, graphs are permutation-invariant to node ordering. We need better identifiability of nodes in a graph — this is often achieved through some form of positional features. For instance, in NLP, the original Transformer uses sinusoidal positional features for the absolute position of a token in a sequence, whereas a recent AliBi introduces a relative positional encoding scheme.
Loss of inductive bias that enables GNNs to work so well on graphs with pronounced locality, which is the case in many real-world graphs. Particularly in those where edges represent relatedness/closeness. By rewiring the graph to be fully connected, we have to put the structure back in some way, otherwise, we are likely to “throw the baby out with the water”.
Last-but-not-least, a limitation can be the square computational complexity O(N²) in the number of nodes whereas message passing GNNs are linear in the number of edges O(E). Graphs are often sparse, i.e., N ≈ E, so the computational burden grows with larger graphs. Can we do something about it?

2021 brought a great variety of positional and structural features for GTs to make nodes more distinguishable.

In graph transformers, the first GT architecture by Dwivedi & Bresson used Laplacian eigenvectors as positional encodings, SAN by Kreuzer et al also added Laplacian eigenvalues to re-weight attention accordingly, Graphormer by Ying et al added shortest path distances as attention bias, GraphTrans by Wu, Jain et al run a GT after passing a graph through a GNN, and the Structure-Aware Tranformer by Chen et al aggregates a k-hop subgraph around each node as its positional feature.

In the land of graph positional features, in addition to the Laplacian-derived features, a recent batch of works includes Random Walk Structural Encodings (RWSE) by Dwivedi et al that takes a diagonal of m-th power random walk matrix, SignNet by Lim, Robinson et al to ensure sign invariance of Laplacian eigenvectors, and Equivariant and Stable PEs by Wang et al that ensures permutation and rotation equivariance of node and position features, respectively.

Well, there are so many of them 🤯 How do I know what suits best for my task?

Is there a principled way to organize and work with all those graph transformer layers and positional features?

Yes! That is what we present in our recent paper Recipe for a General, Powerful, Scalable Graph Transformer.

The GraphGPS framework

In GraphGPS, GPS stands for:

🧩 General — we propose a blueprint for building graph transformers with combining modules for features (pre)processing, local message passing, and global attention into a single pipeline

🏆 Powerful — the GPS graph transformer is provably more powerful than the 1-WL test when paired with proper positional and structural features

📈 Scalable — we introduce linear global attention modules and break through the long-lasting issue of running graph transformers only over molecular graphs (less than 100 nodes on average). Now we can do it on graphs of many thousands of nodes each!

Or maybe it means Graph with Position and Structure? 😉

General: The Blueprint

Why do we need to tinker with message passing GNNs and graph transformers to enable a certain feature if we could use the best of both worlds? Let the model decide what’s important for a given set of tasks and graphs. Generally, the blueprint might be described in one picture:

The GraphGPS blueprint proposes a modular architecture for building graph transformers with various positional, structural features, as well as local and global attention. Source: arxiv. Click to enlarge

It looks a bit massive, so let’s break it down part by part and check what is happening there.

Overall, the blueprint consists of 3 major components:

Node identification through positional and structural encodings. After analyzing many recently published methods for adding positionality in graphs, we found they can be broadly grouped into 3 buckets: local, global, and relative. Such features are provably powerful and help to overcome the notorious 1-WL limitation. More on that below!
Aggregation of node identities with original graph features — those are your input node, edge, and graph features.
Processing layers (GPS layers) — how we actually process the graphs with constructed features, here combine both local message passing (any MPNNs) and global attention models (any graph transformer)
(Bonus 🎁) You can combine any positional and structural feature with any processing layer in our new GraphGPS library based on PyTorch-Geometric!

Powerful: Structural and Positional Features

Structural and positional features aim at encoding a unique characteristic of each node or edge. In the most basic case (illustrated below), when all nodes have the same initial features or no features at all, applying positional and structural features helps to distinguish nodes in a graph, assign them with diverse features, and provide at least some sense of graph structure.

Structural and positional features help distinguishing nodes in a graph. Image by Authors.

We usually separate positional from structural features (although there are works on the theoretical equivalence of those like in Srinivasan & Ribeiro).

Intuitively, positional features help nodes to answer the question “Where am I?” while structural features answer “What does my neighborhood look like?”

Positional encodings (PEs) provide some notion of the position in space of a given node within a graph. They help a node to answer the question “Where am I?”. Ideally, we’d like to have some sort of Cartesian coordinates for each node, but since graphs are topological structures and there exist an infinite amount of ways to position a graph on a 2D plane, we have to think of something different. Talking about PEs, we categorize existing (and theoretically possible) approaches into 3 branches — local PEs, global PEs, and relative PEs.

Categorization of Positional encodings (PEs). Click to enlarge. Image by Authors.

👉 Local PEs (as node features) — within a cluster, the closer two nodes are to each other, the closer their local PE will be, such as the position of a word in a sentence (but not in the text). Examples: (1) distance between a node and the centroid of a cluster containing this node; (2) sum of non-diagonal elements of the random walk matrix of m-steps (m-th power).

👉 Global PEs (as node features) — within a graph, the closer two nodes are, the closer their global PEs are, such as the position of a word in a text. Examples: (1) eigenvectors of the adjacency or Laplacian used in the original Graph Transformer and SAN; (2) distance from the centroid of the whole graph; (3) unique identifier for each connected component

👉 Relative PEs (as edge features) — edge representation that is correlated to the distance given by any local or global PE, such as distance between two worlds. Examples: (1) pair-wise distances obtained from heat kernels, random walks, Green’s function, graph geodesic; (2) gradient of eigenvectors of adjacency or Laplacian, or gradient of any local/global PEs.

Let’s check an example of various PEs on this famous molecule ☕️ (the favorite molecule of Michael Bronstein according to his ICLR’21 keynote).

Illustration of local, global, and relative positional encodings on a caffeine molecule ☕️. Image by Authors

Structural encodings (SEs) provide a representation of the structure of graphs and subgraphs. They help a node to answer the question “What does my neighborhood look like?”. Similarly, we categorize possible SEs into local, global, and relative, although under a different sauce.

Categorization of Structural encodings (SEs). Click to enlarge. Image by Authors.

👉 Local SEs (as node features) allow a node to understand what substructures it is a part of. That is, given two nodes and SE of radium m, the more similar are m-hop subgraphs around those nodes, the closer will be their local SEs. Examples: (1) Node degree (used in Graphormer); (2) diagonals of m-steps random walk matrix (RWSE); (3) Ricci curvature; (4) enumerate or count substructures like triangles and rings (Graph Substructure Networks, GNN As Kernel).

👉 Global SEs (as a graph feature) provide the network with information about the global structure of a graph. If we compare two graphs, their global SEs will be close if their structure is similar. Examples: (1) eigenvalues of the adjacency or Laplacian (used in SAN); (2) well-known graph properties like diameter, number of connected components, girth, average degree.

👉 Relative SEs (as edge features) allow two nodes to understand how much their structures differ. Those can be gradients of any local SE or a boolean indicator when two nodes are in the same substructure (eg, as in GSN).

Illustration of local, global, and relative structural encodings on a caffeine molecule ☕️. Image by Authors

Depending on the graph structure, positional and structural features might bring a lot of expressive power surpassing the 1-WL limit. For instance, in highly-regular Circular Skip Link (CSL) graphs, eigenvectors of Laplacian (Global PEs in our framework) assign unique and different node features for CSL (11, 2) and CSL (11, 3) graphs making them clearly distinguishable (where 1-WL fails).

Positional (PEs) and Structural (SEs) features are more powerful than 1-WL but their effectiveness might depend on the nature of the graphs. Image by Authors.

Aggregation of PEs and SEs

Given that PEs and SEs might be beneficial in different scenarios, why would you limit the model with just one positional / structural feature?

In GraphGPS, we allow to combine any arbitrary amount of PEs and SEs, e.g., 16 Laplacian eigenvectors + eigenvalues + 8d RWSE. In pre-processing, we might have many vectors for each node and we use set aggregation functions that would map them to a single vector to be added to the node feature.

This mechanism enables using subgraph GNNs, SignNets, Equivariant and Stable PEs, k-Subtree SATs, and other models that build node features as complex aggregation functions.

Now, equipped with expressive positional and structural features, we can tackle the final challenge — scalability.

Scalable: Linear Transformers 🚀

Pretty much all existing graph transformers employ a standard self-attention mechanism materializing the whole N² matrix for a graph of N nodes (thus assuming the graph is fully connected). On one hand, it allows to imbue GTs with edge features (like in Graphormer that used edge features as attention bias) and separate true edges from virtual edges (as in SAN). On the other hand, materializing the attention matrix has square complexity O(N²) making GTs hardly scalable to anything beyond molecular graphs of 50–100 nodes.

Luckily, the vast research around Transformers in NLP recently proposed a number of Linear Transformer architectures such as Linformer, Performer, BigBird to scale attention linearly to the input sequence O(N). The whole Long Range Arena benchmark has been created to evaluate linear transformers on extremely long sequences. The essence of linear transformers is to bypass the computation of the full attention matrix and rather approximate its result with various mathematical “tricks” such as low-rank decomposition in Linformer or softmax kernel approximation in Performer. Generally, this is a very active research area 🔥 and we expect there will be more and more effective approaches coming soon.

Vanilla N² transformers (left) materialize the full attention matrix whereas linear transformers (right) bypass this stage through various approximations without significant performance loss. Image by Authors.

Interestingly, there is not much research on linear attention models for graph transformers — to date, we are only aware of the recent ICML 2022 work of Choromanski et al which, unfortunately, did not run experiments on reasonably large graphs.

In GraphGPS, we propose to replace the global attention module (vanilla Transformer) with pretty much any available linear attention module. Applying linear attention leads to two important questions that took us a great deal of experiments to answer:

Since there is no explicit attention matrix computation, how to incorporate edge features? Do we need edge features in GTs at all?
Answer: Empirically, on the datasets we benchmarked, we found that linear global attention in GraphGPS works well even without edge features (given that edge features are processed by some local message passing GNN). Further, we theoretically demonstrate that linear global attention does not lose edge information when input node features already encode the edge features.
What is the tradeoff between speed and performance of linear attention models?
Answer: The tradeoff is quite beneficial — we did not find major performance drops when switching from square to linear attention models, but found a huge memory improvement. That is, at least in the current benchmarks, you can simply swap the full attention to linear and train models on dramatically larger graphs without huge performance losses. Still, if we want to be more sure about linear global attention performance, there is a need for larger benchmarks with larger graphs and long-range dependencies.

👨‍🍳 Recipe time — how to get the best out of your GT

Long story short — a tuned GraphGPS, a combination of local and global attention, performs very competitively to more sophisticated and computationally more expensive models and sets a new SOTA on many benchmarks!

For example, in the molecular regression benchmark ZINC, GraphGPS reaches a new all-time low 0.07 MAE. The progress in the field is really fast — last year’s SAN did set a SOTA of 0.139, so we improved the error rate by a solid 50%! 📉

Furthermore, thanks to efficient implementation, we dramatically improved the speed of graph transformers — about 400% faster 🚀 — 196 s/epoch on ogbg-molpcba compared to 883 s/epoch of SAN, the previous SOTA graph transformer model.

We experiment with Performer and BigBird as linear global attention models and scale GraphGPS to graphs of up to 10,000 nodes fitting on a standard 32 GB GPU which is previously unattainable by any graph transformer.

Finally, we open-source the GraphGPS library (akin to the GraphGym environment), where you can easily plug, combine, and configure:

Any local message passing model with and without edge features
Any global attention model, eg, full Transformer or any linear architecture
Any structural (SE) and positional (PE) encoding method
Any combination of SEs and PEs, eg, Laplacian PEs with Random-Walk RWSE!
Any method for aggregating SEs and PEs, eg, SignNet or DeepSets
Run it on any graph dataset supported by PyG or with a custom wrapper
Run large-scale experiments with Wandb tracking
And, of course, replicate the results of our experiments

📜 arxiv preprint: https://arxiv.org/abs/2205.12454

🔧 Github repo: https://github.com/rampasek/GraphGPS

GraphGPS: Navigating Graph Transformers was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.