<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:cc="http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html">
    <channel>
        <title><![CDATA[Stories by Michael Galkin on Medium]]></title>
        <description><![CDATA[Stories by Michael Galkin on Medium]]></description>
        <link>https://medium.com/@mgalkin?source=rss-4d4f8ddd1e68------2</link>
        <image>
            <url>https://cdn-images-1.medium.com/fit/c/150/150/2*R6303tLavDAf6jJAsMlaJQ.jpeg</url>
            <title>Stories by Michael Galkin on Medium</title>
            <link>https://medium.com/@mgalkin?source=rss-4d4f8ddd1e68------2</link>
        </image>
        <generator>Medium</generator>
        <lastBuildDate>Tue, 07 Apr 2026 07:19:43 GMT</lastBuildDate>
        <atom:link href="https://medium.com/@mgalkin/feed" rel="self" type="application/rss+xml"/>
        <webMaster><![CDATA[yourfriends@medium.com]]></webMaster>
        <atom:link href="http://medium.superfeedr.com" rel="hub"/>
        <item>
            <title><![CDATA[Foundation Models in Graph & Geometric Deep Learning]]></title>
            <link>https://medium.com/data-science/foundation-models-in-graph-geometric-deep-learning-f363e2576f58?source=rss-4d4f8ddd1e68------2</link>
            <guid isPermaLink="false">https://medium.com/p/f363e2576f58</guid>
            <category><![CDATA[graph-neural-networks]]></category>
            <category><![CDATA[generative-ai-tools]]></category>
            <category><![CDATA[geometric-deep-learning]]></category>
            <category><![CDATA[editors-pick]]></category>
            <category><![CDATA[foundation-models]]></category>
            <dc:creator><![CDATA[Michael Galkin]]></dc:creator>
            <pubDate>Tue, 18 Jun 2024 18:18:06 GMT</pubDate>
            <atom:updated>2024-06-18T18:18:06.570Z</atom:updated>
            <content:encoded><![CDATA[<p>Foundation Models in language, vision, and audio have been among the primary research topics in Machine Learning in 2024 whereas FMs for graph-structured data have somewhat lagged behind. In this post, we argue that the era of Graph FMs has already begun and provide a few examples of how one can use them already today.</p><p><em>This post was written and edited by </em><a href="https://twitter.com/michael_galkin"><em>Michael Galkin</em></a><em> and </em><a href="https://twitter.com/mmbronstein"><em>Michael Bronstein</em></a><em> with significant contributions from </em><a href="https://twitter.com/AndyJiananZhao"><em>Jianan Zhao</em></a><em>, </em><a href="https://twitter.com/haitao_mao_"><em>Haitao Mao</em></a><em>, </em><a href="https://twitter.com/zhu_zhaocheng"><em>Zhaocheng Zhu</em></a><em>.</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*TGMQ_AUPRWSDfRcZ" /><figcaption>The timeline of emerging foundation models in graph- and geometric deep learning. Image by Authors.</figcaption></figure><h3><strong>Table of Contents</strong></h3><ol><li><a href="#6f0a">What are Graph Foundation Models and how to build them?</a></li><li><a href="#b4b7">Node Classification: GraphAny</a></li><li><a href="#89a1">Link Prediction: Not yet</a></li><li><a href="#bda1">Knowledge Graph Reasoning: ULTRA and UltraQuery</a></li><li><a href="#11c3">Algorithmic Reasoning: Generalist Algorithmic Learner</a></li><li><a href="#65b0">Geometric and AI4Science Foundation Models</a><strong><br></strong>a. <a href="#4d3e">ML Potentials: JMP-1, DPA-2 for molecules, MACE-MP-0 and MatterSim for inorganic crystals </a><br>b. <a href="#b2cd">Protein LMs: ESM-2</a><br>c. <a href="#cc8c">2D Molecules: MiniMol and MolGPS</a></li><li><a href="#1443">Expressivity &amp; Scaling Laws: Do Graph FMs scale?</a></li><li><a href="#40c5">The Data Question: What should be scaled? Is there enough graph data to train Graph FMs?</a></li><li><a href="#2add">👉 Key Takeaways 👈</a></li></ol><h3>What are Graph Foundation Models and how to build them?</h3><p>Since there is a certain degree of ambiguity in what counts as a “foundational” model, it would be appropriate to start with a definition to establish a common ground:</p><blockquote>“A Graph Foundation Model is a single (neural) model that learns transferable graph representations that can generalize to any new, previously unseen graph”</blockquote><p>One of the challenges is that graphs come in all forms and shapes and their connectivity and feature structure can be very different. Standard Graph Neural Networks (GNNs) are not “foundational” because they can work in the best case only on graphs with the same type and dimension of features. Graph heuristics like <a href="https://en.wikipedia.org/wiki/Label_propagation_algorithm">Label Propagation</a> or <a href="https://en.wikipedia.org/wiki/PageRank">Personalized PageRank</a> that can run on any graph can neither be considered Graph FMs because they do not involve any learning. As much as we love Large Language Models, it is still unclear whether parsing graphs into sequences that can be then passed to an LLM (like in <a href="https://arxiv.org/abs/2310.01089">GraphText</a> or <a href="https://openreview.net/forum?id=IuXR1CCrSi">Talk Like A Graph</a>) is a suitable approach for retaining graph symmetries and scaling to anything larger than toy-sized datasets (we leave LLMs + Graphs to a separate post).</p><p>Perhaps the most important question for designing Graph FMs is transferable graph representations. LLMs, as suggested in the recent <a href="https://arxiv.org/abs/2402.02216">ICML 2024 position paper by Mao, Chen et al</a>., can squash any text in any language into tokens from a fixed-size vocabulary. Video-Language FMs resort to patches that can always be extracted from an image (one always has RGB channels in any image or video). It is not immediately clear what could a universal featurization (à la tokenization) scheme be for graphs, which might have very diverse characteristics, e.g.:</p><ul><li>One large graph with node features and some given node labels (typical for node classification tasks)</li><li>One large graph without node features and classes, but with meaningful edge types (typical for link prediction and KG reasoning)</li><li>Many small graphs with/without node/edge features, with graph-level labels (typical for graph classification and regression)</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*gDBGn9ckqNeNqTLD" /><figcaption>🦄 An ideal graph foundation model that takes any graph with any node/edge/graph features and performs any node- / edge- / graph-level task. Such Graph FMs do not exist in pure form as of mid-2024. Image by Authors</figcaption></figure><p>So far, there is a handful of open research questions for the graph learning community when designing Graph FMs:</p><p><strong>1️⃣ How to generalize across graphs with heterogeneous node/edge/graph features? </strong>For example, the popular <a href="https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.Planetoid.html#torch_geometric.datasets.Planetoid">Cora</a> dataset for node classification is one graph with node features of dimension 1,433, whereas the Citeseer dataset has 3,703-dimensional features. How can one define a single representation space for such diverse graphs?</p><p><strong>2️⃣ How to generalize across prediction tasks?</strong> Node classification tasks may have a different number of node classes (e.g., Cora has 7 classes and Citeseer 6). Even further, can a node classification model perform well in link prediction?</p><p><strong>3️⃣ What should the foundational model expressivity be?</strong> Much research has been done on the expressive power of GNNs, typically resorting to the analogy with Weisfeiler-Lehman isomorphism tests. Since graph foundational models should ideally handle a broad spectrum of problems, the right expressive power is elusive. For instance, in node classification tasks, node features are important along with graph homophily or heterophily. In link prediction, structural patterns and breaking automorphisms are more important (node features often don’t give a huge performance boost). In graph-level tasks, graph isomorphism starts to play a crucial role. In 3D geometric tasks like molecule generation, there is an additional complexity of continuous symmetries to take care of (see the <a href="https://arxiv.org/abs/2312.07511">Hitchhiker’s Guide to Geometric GNNs</a>).</p><p>In the following sections, we will show that at least in some tasks and domains, Graph FMs are already available. We will highlight their design choices when it comes to transferable features and practical benefits when it comes to inductive inference on new unseen graphs.</p><p><strong>📚Read more in references [1][2] and </strong><a href="https://github.com/CurryTang/Awesome_Graph_Foundation_Models"><strong>Github Repo</strong></a></p><h3>Node Classification: GraphAny</h3><p>For years, GNN-based node classifiers have been limited to a single graph dataset. That is, given e.g. the Cora graph with 2.7K nodes, 1433-dimensional features, and 7 classes, one has to train a GNN specifically on the Cora graph with its labels and run inference on the same graph. Applying a trained model on another graph, e.g. Citeseer with 3703-dimensional features and 6 classes would run into an unsurmountable difficulty: how would one model generalize to different input feature dimensions and a different number of classes? Usually, prediction heads are hardcoded to a fixed number of classes.</p><p><a href="https://arxiv.org/abs/2405.20445"><strong>GraphAny</strong></a> is, to the best of our knowledge, the first Graph FM where a single pre-trained model can perform node classification on any graph with any feature dimension and any number of classes. A single GraphAny model pre-trained on 120 nodes of the standard <a href="https://pytorch-geometric.readthedocs.io/en/latest/generated/torch_geometric.datasets.WebKB.html#torch_geometric.datasets.WebKB">Wisconsin</a> dataset successfully generalizes to 30+ other graphs of different sizes and features and, on average, outperforms GCN and GAT graph neural network architectures trained from scratch on each of those graphs.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*N1oNlLJBZgbR-NzO" /><figcaption>Overview of GraphAny: LinearGNNs are used to perform non-parametric predictions and derive the entropy-normalized distance features. The final prediction is generated by fusing multiple LinearGNN predictions on each node with attention learned based on the distance features. Source: <a href="https://arxiv.org/abs/2405.20445">Zhao et al</a>.</figcaption></figure><p><strong>Setup: </strong>Semi-supervised node classification: given a graph G, node features X, and a few labeled nodes from C classes, predict labels of target nodes (binary or multi-class classification). The dimension of node features and the number of unique classes are not fixed and are graph-dependent.</p><p><strong>What is transferable: </strong>Instead of modeling a universal latent space for all possible graphs (which is quite cumbersome or maybe even practically impossible), GraphAny bypasses this problem and focuses on the <em>interactions between predictions of spectral filters</em>. Given a collection of high-pass and low-pass filters akin to <a href="https://arxiv.org/abs/1902.07153">Simplified Graph Convolutions</a> (for instance, operations of the form AX and (I-A)X, dubbed “LinearGNNs” in the paper) and known node labels:</p><p>0️⃣ GraphAny applies filters to all nodes;</p><p>1️⃣ GraphAny obtains optimal weights for each predictor from nodes with known labels by solving a least squares optimization problem in closed form (optimal weights are expressed as a pseudoinverse);</p><p>2️⃣ Applies the optimal weights to unknown nodes to get tentative prediction logits;</p><p>3️⃣ Computes pair-wise distances between those logits and applies entropy regularization (such that different graph- and feature sizes will not affect the distribution). For example, for 5 LinearGNNs, this would result in 5 x 4 = 20 combinations of logit scores;</p><p>4️⃣ Learns the inductive attention matrix over those logits to weight the predictions most effectively (e.g., putting more attention to high-pass filters for heterophilic graphs).</p><p>In the end, the only learnable component in the model is the parameterization of attention (via MLP), which <em>does not depend</em> on the target number of unique classes, but only on the number of LinearGNNs used. In the same vein, all LinearGNN predictors are non-parametric, their updated node features and optimal weights can be pre-computed beforehand for faster inference.</p><p><strong>📚Read more in references [3]</strong></p><h3>Link Prediction: Not yet</h3><p><strong>Setup</strong>: given a graph G, with or without node features, predict whether a link exists between a pair of nodes (v1, v2)</p><p>😢 For graphs with node features, we are not aware of any single transferable model for link prediction.</p><p>For non-featurized graphs (or when you decide to omit node features deliberately), there is more to say — basically, all GNNs with a labeling trick <em>potentially</em> can transfer to new graphs thanks to the uniform node featurization strategy.</p><p>It is known that in link prediction, the biggest hurdle is the presence of automorphic nodes (nodes that have the same structural roles) — vanilla GNNs assign them the same feature making two links (v1, v2) and (v1, v3) in the image below 👇 indistinguishable. <a href="https://arxiv.org/abs/2010.16103">Labeling tricks</a> like <a href="https://proceedings.neurips.cc/paper/2018/hash/53f0d7c537d99b3824f0f99d62ea2428-Abstract.html">Double Radius Node Labeling</a> or <a href="https://proceedings.neurips.cc/paper_files/paper/2020/hash/2f73168bf3656f697507752ec592c437-Abstract.html">Distance Encoding</a> are such node featurization strategies that break automorphism symmetries.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/912/0*ni6nDAFit_qzND1l" /><figcaption>V2 and v3 are automorphic nodes and standard GNNs score (v1,v2) and (v1,v3) equally. When we predict (v1, v2), we will label these two nodes differently from the rest, so that a GNN is aware of the target link when learning v1 and v2’s representations. Similarly, when predicting (v1, v3), nodes v1 and v3 will be labeled differently. This way, the representation of v2 in the left graph will be different from that of v3 in the right graph, enabling GNNs to distinguish the non-isomorphic links (v1, v2) and (v1, v3). Source: <a href="https://arxiv.org/abs/2010.16103">Zhang et al</a>.</figcaption></figure><p>Perhaps the only approach with a labeling trick (for non-featurized graphs) that was evaluated on link prediction on unseen graphs is <a href="https://arxiv.org/abs/2402.07738">UniLP</a>. UniLP is an in-context, contrastive learning model that requires a set of positive and negative samples for each target link to be predicted. Practically, UniLP uses <a href="https://proceedings.neurips.cc/paper/2018/hash/53f0d7c537d99b3824f0f99d62ea2428-Abstract.html">SEAL</a> as a backbone GNN and learns an attention over a fixed number of positive and negative samples. On the other hand, SEAL is notoriously slow, so the first step towards making UniLP scale to large graphs is to replace subgraph mining with more efficient approaches like <a href="https://arxiv.org/abs/2209.15486">ELPH</a> and <a href="https://arxiv.org/abs/2209.15486">BUDDY</a>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*1S7-wj4N-5zLgUlG" /><figcaption>Overview of the Universal Link Predictor framework. (a) For predicting a query link 𝑞, we initially sample positive (𝑠+) and negative (𝑠-) in-context links from the target graph. Both the query link and these in-context links are independently processed through a shared subgraph GNN encoder. An attention mechanism then calculates scores based on the similarity between the query link and the in-context links. (b) The final representation of the query link, contextualized by the target graph, is obtained through a weighted summation, which combines the representations of the in-context links with their respective labels. Source: <a href="https://arxiv.org/abs/2402.07738">Dong et al.</a></figcaption></figure><p><strong>What is transferable: </strong>structural patterns learned by labeling trick GNNs — it is proven that methods like <a href="https://arxiv.org/abs/2106.06935">Neural Bellman-Ford</a> capture metrics over node pairs, eg, Personalized PageRank or Katz index (often used for link prediction).</p><p>Now, as we know how to deal with automorphisms, the only step towards a single graph FM for link prediction would be to add a support for heterogeneous node features — perhaps GraphAny-style approaches might be an inspiration?</p><p><strong>📚Read more in references [4][5][6][7]</strong></p><h3>Knowledge Graph Reasoning: ULTRA and UltraQuery</h3><p>Knowledge graphs have graph-specific sets of entities and relations, e.g. common encyclopedia facts from Wikipedia / Wikidata or biomedical facts in Hetionet, those relations have different semantics and are not directly mappable to each other. For years, KG reasoning models were hardcoded to a given vocabulary of relations and could not transfer to new, unseen KGs with completely new entities and relations.</p><p><a href="https://openreview.net/forum?id=jVEoydFOl9">ULTRA</a> is the first foundation model for KG reasoning that transfers to any KG at inference time in the zero-shot manner. That is, a single pre-trained model can run inference on any multi-relational graph with any size and entity/relation vocabulary. Averaged over 57 graphs, ULTRA significantly outperforms baselines trained specifically on each graph. Recently, ULTRA was extended to <a href="https://arxiv.org/abs/2404.07198">UltraQuery</a> to support even more complex logical queries on graphs involving conjunctions, disjunctions, and negation operators. UltraQuery transfers to unseen graphs and 10+ complex query patterns on those unseen graphs outperforming much larger baselines trained from scratch.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*9StOqWdoahiif204" /><figcaption>Given a query (Michael Jackson, genre, ?), ULTRA builds a graph of relations (edge types) to capture their interactions in the original graph conditioned on the query relation (genre) and derives relational representations from this smaller graph. Those features are then used as edge type features in the original bigger graph to answer the query. Source: <a href="https://openreview.net/forum?id=jVEoydFOl9">Galkin et al</a>.</figcaption></figure><p><strong>Setup: </strong>Given a multi-relational graph G with |E| nodes and |R| edge types, no node features, answer simple KG completion queries <em>(head, relation, ?) </em>or complex queries involving logical operators by returning a probability distribution over all nodes in the given graph. The set of nodes and relation types depends on the graph and can vary.</p><p><strong>What is transferable: </strong>ULTRA relies on modeling relational interactions. Forgetting about relation identities and target graph domain for a second, if we see that relations “authored” and “collaborated” can share the same starting node, and relations “student” and “coauthor” in another graph can share a starting node, then the relative, structural representations of those two pairs of relations might be similar. This holds for any multi-relational graph in any domain, be it encyclopedia or biomedical KGs. ULTRA goes further and captures 4 such “fundamental” interactions between relations. Those fundamental interactions are transferable to any KG (together with learned GNN weights) — this way, one single pre-trained model is ready for inference on any unseen graph and simple or complex reasoning query.</p><p>Read more in the dedicated Medium post:</p><p><a href="https://towardsdatascience.com/ultra-foundation-models-for-knowledge-graph-reasoning-9f8f4a0d7f09">ULTRA: Foundation Models for Knowledge Graph Reasoning</a></p><p><strong>📚Read more in references [8][9]</strong></p><h3>Algorithmic Reasoning: Generalist Algorithmic Learner</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*whg61mwz584iEdBV" /><figcaption>A generalist neural algorithmic learner is a single processor GNN P, with a single set of weights, capable of solving several algorithmic tasks in a shared latent space (each of which is attached to P with simple encoders/decoders f and g). Among others, the processor network is capable of sorting (top), shortest path-finding (middle), and convex hull finding (bottom). Source: <a href="https://proceedings.mlr.press/v198/ibarz22a/ibarz22a.pdf">Ibarz et al.</a></figcaption></figure><p><strong>Setup: </strong><a href="https://arxiv.org/abs/2105.02761">Neural algorithmic reasoning</a> (NAR) studies the execution of standard algorithms (eg, sorting, searching, dynamic programming) in the latent space and generalization to the inputs of arbitrary size. A lot of such algorithms can be represented with a graph input and pointers. Given a graph G with node and edge features, the task is to simulate the algorithm and produce the correct output. Optionally, you can get access to hints — time series of intermediate states of the algorithm which can act as the intermediate supervised signal. Obviously, different algorithms require a different number of steps to execute, so length is not fixed here.</p><p><strong>What is transferable: </strong>Homogeneous feature space and similar control flow for similar algorithms. For instance, Prim’s and Dijkstra’s algorithms share a similar structure, differing only in the choice of key function and edge relaxation subroutine. Besides, there are <a href="https://arxiv.org/abs/1905.13211">several</a> <a href="https://arxiv.org/abs/2203.15544">proofs</a> of a direct alignment between message passing and dynamic programming. This is the main motivation behind one “processor” neural network that updates latent states for all considered algorithms (<a href="https://github.com/google-deepmind/clrs">30 classic algos</a> from the CLRS book).</p><p><a href="https://proceedings.mlr.press/v198/ibarz22a/ibarz22a.pdf">Triplet-GMPNN</a> was the first such universal processor neural net (by 2024 it became rather standard in the NAR literature) — it is a GNN that operates on triples of nodes and their features (akin to <a href="https://arxiv.org/abs/2112.00578">Edge Transformers</a> and triangular attention in AlphaFold). The model is trained in the multi-task mode on all algorithmic tasks in the benchmark with a handful of optimization and tricks. A single model bumps the average performance on 30 tasks by over 20% (in absolute numbers) compared to single-task specialist models.</p><p>Still, encoders and decoders are parameterized specifically for each task — one of the ways to unify the input and output formats might as well be text with LLM processors as done in the recent <a href="https://arxiv.org/abs/2406.04229">text version of CLRS</a>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*T4qoNOb0Q0DYiozX" /><figcaption><strong>Top</strong>: The graph algorithmic trace of insertion sorting a list <em>[5, 2, 4, 3, 1]</em> in graph form. <strong>Bottom</strong>: The same algorithmic trace, represented textually, by using the CLRS-Text generator. The model receives as input (depicted in green) the input array (key) and the initial value of the sorting trace (initial_trace), using which it is prompted to predict the trace (depicted in blue) of gradually sorting the list, by inserting one element at a time into a partially sorted list, from left to right. At the end, the model needs to output the final sorted array (depicted in red), and it is evaluated on whether this array is predicted correctly. Source: <a href="https://arxiv.org/abs/2406.04229">Markeeva, McLeish, Ibarz, et al.</a></figcaption></figure><p>Perhaps the most interesting question of 2024 and 2025 in NAR is:</p><blockquote><em>Can algorithmic reasoning ideas for OOD generalization be the key to generalizable LLM reasoning?</em></blockquote><p>LLMs notoriously struggle with complex reasoning problems, dozens of papers appear on arxiv every month trying a new prompting method to bump benchmarking performance another percentage-or-two, but most of them do not transfer across tasks of similar graph structures (see the example below). There is a need for more principled approaches and NAR has the potential to fill this gap!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*myXuP9-sJXrcjlOc" /><figcaption>Failure of LLMs on reasoning problems with similar graph structures. Image by Authors.</figcaption></figure><p><strong>📚Read more in references [10][11]</strong></p><h3>Geometric and AI4Science Foundation Models</h3><p>In the world of Geometric Deep Learning and scientific applications, foundation models are becoming prevalent as universal ML potentials, protein language models, and universal molecular property predictors. Although the universal vocabulary exists in most such cases (e.g., atom types in small molecules or amino acids in proteins) and we do not have to think about universal featurization, the main complexity lies in the real-world physical nature of atomistic objects — they have pronounced 3D structure and properties (like energy), which have theoretical justifications rooted in chemistry, physics, and quantum mechanics.</p><h3>ML Potentials: JMP-1, DPA-2 for molecules, MACE-MP-0 and MatterSim for inorganic crystals</h3><p><strong>Setup</strong>: given a 3D structure, predict the energy of the structure and per-atom forces;</p><p><strong>What is transferable</strong>: a vocabulary of atoms from the periodic table.</p><p>ML potentials estimate the potential energy of a chemical compound — like molecules or periodic crystals — given their 3D coordinates and optional input (like periodic boundary conditions for crystals). For any atomistic model, the vocabulary of possible atoms is always bound by the <a href="https://en.wikipedia.org/wiki/Periodic_table">Periodic Table</a> which currently includes 118 elements. The “foundational” aspect in ML potentials is to generalize to any atomistic structure (there can be combinatorially many) and be stable enough to be used in molecular dynamics (MD), drug- and materials discovery pipelines.</p><p><a href="https://arxiv.org/abs/2310.16802">JMP-1</a> and <a href="https://arxiv.org/abs/2312.15492">DPA-2</a> released around the same time aim to be such universal ML potential models — they are trained on a sheer variety of structures — from organic molecules to crystals to MD trajectories. For example, a single pre-trained JMP-1 excels at QM9, rMD17 for small molecules, MatBench and QMOF on crystals, and MD22, SPICE on large molecules being on-par or better than specialized per-dataset models. Similarly, <a href="https://arxiv.org/abs/2401.00096">MACE-MP-0</a> and <a href="https://arxiv.org/abs/2405.04967">MatterSim</a> are the most advanced FMs for inorganic crystals (MACE-MP-0 is already available with weights) evaluated on 20+ crystal tasks ranging from multicomponent alloys to combustion and molten salts. Equivariant GNNs are at the heart of those systems helping to process equivariant features (Cartesian coordinates) and invariant features (like atom types).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*QaoSPDK2f0EhOiGt" /><figcaption>Sources: (1) Pre-training and fine-tuning of <strong>JMP-1</strong> for molecules and crystals, <a href="https://arxiv.org/abs/2310.16802">Shoghi et al</a> (2) <strong>MACE-MP-0</strong> is trained only on the Materials Project data and transfers to molecular dynamics simulation across a wide variety of chemistries in the solid, liquid and gaseous phases, <a href="https://arxiv.org/abs/2401.00096">Batatia, Benner, Chiang, Elena, Kovács, Riebesell et al</a>.</figcaption></figure><p>The next frontier seems to be ML-accelerated molecular dynamics simulations — traditional computational methods work at the femtosecond scale (10–15) and require millions and billions of steps to simulate a molecule, crystal, or protein. Speeding up such computations would have an immense scientific impact.</p><p><strong>📚Read more in references [12][13][14][15]</strong></p><h3>Protein LMs: ESM-2</h3><p><strong>Setup</strong>: given a protein sequence, predict the masked tokens akin to masked language modeling;</p><p><strong>What is transferable</strong>: a vocabulary of 20 (22) amino acids.</p><p>Protein sequences resemble natural language with amino acids as tokens, and Transformers excel at encoding sequence data. Although the vocabulary of amino acids is relatively small, the space of possible proteins is enormous, so training on large volumes of known proteins might hint at the properties of unseen combinations. <a href="https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1">ESM-2</a> is perhaps the most popular protein LM thanks to the pre-training data size, a variety of available checkpoints, and informative features.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*J2R_VoLXp6ct8Arm" /><figcaption>ESM2 as a masked LM and ESMFold for protein structure prediction. Source: <a href="https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1">Lin, Akin, Rao, Hie, et al.</a></figcaption></figure><p>ESM features are used in countless applications from predicting 3D structure (in <a href="https://github.com/facebookresearch/esm">ESMFold</a>) to protein-ligand binding (<a href="https://arxiv.org/abs/2210.01776">DiffDock</a> and its descendants) to protein structure generative models (like a recent <a href="https://www.dreamfold.ai/blog/foldflow-2">FoldFlow 2</a>). Bigger transformers and more data are likely to increase protein LMs’ performance even further — at this scale, however, the data question becomes more prevalent (we also discuss the interplay between architecture and data in the dedicated section), eg, the <a href="https://esmatlas.com/">ESM Metagenomic Atlas</a> already encodes 700M+ structures including those seen outside humans in the soil, oceans, or hydrothermal vents. Is there a way to trillions of tokens as in common LLM training datasets?</p><p><strong>📚Read more in references [16][17]</strong></p><h3>2D Molecules: MiniMol and MolGPS</h3><p><strong>Setup</strong>: given a 2D graph structure with atom types and bond types, predict molecular properties</p><p><strong>What is transferable</strong>: a vocabulary of atoms from the periodic table and bond types</p><p>With 2D graphs (without 3D atom coordinates) universal encoding and transferability come from a fixed vocabulary of atom and bond types which you can send to any GNN or Transformer encoder. Although molecular fingerprints have been used since 1960s (<a href="https://pubs.acs.org/doi/abs/10.1021/c160017a018">Morgan fingerprints</a> [18]), their primary goal was to evaluate similarity, not to model a latent space. The task of a single (large) neural encoder is to learn useful representations that might hint at certain physical molecular properties.</p><p>Recent examples of generalist models for learning molecular representations are <a href="https://arxiv.org/pdf/2404.14986">MiniMol</a> and <a href="https://arxiv.org/abs/2404.11568v1">MolGPS</a> which have been trained on a large corpus of molecular graphs and probed on dozens of downstream tasks. That said, you still need to fine-tune a separate task-specific decoder / predictor given the models’ representations — in that sense, one single pre-trained model will not be able to run zero-shot inference on all possible unseen tasks, rather on those for which decoders have been trained. Fine-tuning is still a good cheap option though since those models are orders of magnitude smaller than LLMs.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*pvGmT5XrDSNuzmks" /><figcaption>Source: (1) Workflow overview of the <a href="https://arxiv.org/pdf/2404.14986">MiniMol</a> pre-training and downstream task evaluation. (2) Criteria of the scaling study of <a href="https://arxiv.org/abs/2404.11568v1">MolGPS</a></figcaption></figure><p><strong>📚Read more in references [19][20]</strong></p><h3>Expressivity &amp; Scaling Laws: Do Graph FMs scale?</h3><p>Transformers in LLMs and multi-modal frontier models are rather standard and we know some basic scaling principles for them. Do transformers (as an architecture, not LLMs) work equally well on graphs? What are the general challenges when designing a backbone for Graph FMs?</p><p>If you categorize the models highlighted in the previous sections, only 2 areas feature transformers — protein LMs (ESM) with a natural sequential bias and small molecules (MolGPS). The rest are GNNs. There are several reasons for that:</p><ul><li>Vanilla transformers do not scale to any reasonably large graph larger than a standard context length (&gt;4–10k nodes). Anything above that range requires tricks like feeding only subgraphs (losing the whole graph structure and long-range dependencies) or linear attention (that might not have good scaling properties). In contrast, GNNs are linear in the number of edges, and, in the case of sparse graphs (V ~ E), are linear in the number of nodes.</li><li>Vanilla transformers without positional encodings are <a href="https://arxiv.org/abs/2302.04181">less expressive than GNNs</a>. Mining positional encodings like Laplacian PEs on a graph with V nodes is O(V³).</li><li>What should be a “token” when encoding graphs via transformers? There is no clear winner in the literature, e.g., <a href="https://arxiv.org/abs/2106.05234">nodes</a>, <a href="https://arxiv.org/abs/2406.03148">nodes + edges</a>, or <a href="https://arxiv.org/abs/2212.13350">subgraphs</a> are all viable option</li></ul><p><strong>➡️ </strong>Touching upon <strong>expressivity</strong>, different graph tasks need to deal with different symmetries, e.g., automorphic nodes in link prediction lead to indistinguishable representations, whereas in graph classification/regression going beyond 1-WL is necessary for distinguishing molecules which otherwise might look isomorphic to vanilla GNNs.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*ZriY6GNKtUx4ufHK" /><figcaption>Different tasks need to deal with different symmetries. Image by Authors. Sources of graphs: (1) <a href="https://arxiv.org/abs/2010.16103">Zhang et al</a>, (2) <a href="https://arxiv.org/abs/2112.09992">Morris et al</a></figcaption></figure><p>This fact begs two questions:</p><blockquote><em>How expressive should GFMs be? What is the trade-off between expressivity and scalability?</em></blockquote><p>Ideally, we want a single model to resolve all those symmetries equally well. However, more expressive models would lead to more computationally expensive architectures both in training and inference. We agree with the recent <a href="https://arxiv.org/abs/2402.02287">ICML’24 position paper on the future directions in Graph ML theory</a> that the community should seek the balance between expressivity, generalization, and optimization.</p><p>Still, it is worth noting that with the growing availability of training data, it might be a computationally cheaper idea to defer learning complex symmetries and invariances directly from the data (instead of baking them into a model). A few recent good examples of this thesis are <a href="https://www.nature.com/articles/s41586-024-07487-w">AlphaFold 3</a> and <a href="https://arxiv.org/abs/2311.17932">Molecular Conformer Fields</a> that reach SOTA in many generative applications <em>without</em> expensive equivariant geometric encoders.</p><p><strong>📚Read more in references [21]</strong></p><p><strong>➡️ </strong>When it comes to <strong>scaling</strong>, both model and data should be scaled up. However:</p><p>❌ Non-geometric graphs: There is no principled study on scaling GNNs or Transformers to large graphs and common tasks like node classification and link prediction. 2-layer GraphSAGE is often not very far away from huge 16-layer graph transformers. In a similar trend, in the KG reasoning domain, a single ULTRA model (discussed above) with &lt;200k parameters outperforms million-sized shallow embedding models on 50+ graphs. Why is it happening? We’d hypothesize the crux is in 1️⃣ task nature — most of non-geometric graphs are noisy similarity graphs that are not bounded to a concrete physical phenomenon like molecules 2️⃣ Given rich node and edge features, models have to learn <em>representations of graph structures</em> (common for link prediction) or just <em>functions over given features</em> (a good example is <a href="https://ogb.stanford.edu/docs/leader_nodeprop/">node classification in OGB</a> where most gains are achieved by adding an LLM feature encoder).</p><p>✅ Geometric graphs: There are several recent works focusing on molecular graphs:</p><ul><li><a href="https://www.nature.com/articles/s42256-023-00740-3">Frey et al</a> (2023) study scaling of geometric GNNs for ML potentials;</li><li><a href="https://arxiv.org/abs/2404.11568v1">Sypetkowski, Wenkel et al</a> (2024) introduce MolGPS and study scaling MPNNs and Graph Transformers up to 1B parameters on the large dataset of 5M molecules</li><li><a href="https://arxiv.org/abs/2402.02054">Liu et al</a> (2024) probe GCN, GIN, and GraphGPS up to 100M parameters on molecular datasets up to 4M molecules.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*1gK0G_v_Kj_aO4ef" /><figcaption>Scaling molecular GNNs and GTs. Sources: (1) <a href="https://arxiv.org/abs/2404.11568v1">Sypetkowski, Wenkel et al</a>, (2) <a href="https://arxiv.org/abs/2402.02054">Liu et al</a></figcaption></figure><h3>The Data Question: What should be scaled? Is there enough graph data to train Graph FMs?</h3><p>1️⃣ <strong>What should be scaled in graph data? </strong>Nodes? Edges? The number of graphs? Something else?</p><p>There is no clear winner in the literature, we would rather gravitate towards a broader term <strong><em>diversity</em></strong>, that is, a diversity of patterns in the graph data. For example, in node classification on large product graphs, it likely would not matter much if you train on a graph with 100M nodes or 10B nodes since it’s the same nature of a user-item graph. However, showing examples with homophily and heterophily on different scales and sparsities might be quite beneficial. In <strong>GraphAny</strong>, showing examples of such graphs allowed to build a robust node classifier that generalizes to different graph distributions,</p><p>In KG reasoning with <strong>ULTRA</strong>, it was found that the <strong><em>diversity of relational patterns</em></strong> in pre-training plays the biggest role in inductive generalization, e.g., one large dense graph is worse than a collection of smaller but sparse, dense, few-relational, and many-relational graphs.</p><p>In molecular graph-level tasks, e.g., in <strong>MolGPS</strong>, scaling the number of unique molecules with different physical properties helps a lot (as shown in the charts above 👆).</p><p>Besides, <a href="https://arxiv.org/abs/2406.01899">UniAug</a> finds that increased coverage of the structural patterns in pre-training data adds to the performance across different downstream tasks from various domains.</p><p><strong>2️⃣ Is there enough data to train Graph FMs?</strong></p><p>Openly available graph data is orders of magnitudes smaller than natural language tokens or images or videos, and it is fine. This very article includes thousands of language and image tokens and no explicit graphs (unless you try to parse this text to a graph like an <a href="https://en.wikipedia.org/wiki/Abstract_Meaning_Representation">abstract meaning representation</a> graph). The number of ‘good’ proteins with known structures in PDB is small, the number of known ‘good’ molecules for drugs is small.</p><blockquote>Are Graph FMs doomed because of data scarcity?</blockquote><p>Well, not really. The two open avenues are: (1) more sample-efficient architectures; (2) using more black-box and synthetic data.</p><p>Synthetic benchmarks like <a href="https://arxiv.org/abs/2203.00112">GraphWorld</a> might be of use to increase the diversity of training data and improve generalization to real-world datasets. Black-box data obtained from scientific experiments, in turn, is likely to become the key factor in building successful foundation models in AI 4 Science — those who master it will prevail on the market.</p><p><a href="https://towardsdatascience.com/the-road-to-biology-2-0-will-pass-through-black-box-data-bbd00fabf959">The Road to Biology 2.0 Will Pass Through Black-Box Data</a></p><p><strong>📚Read more in references [20][22][23]</strong></p><h3>👉 Key Takeaways 👈</h3><p><strong>➡️ How to generalize across graphs with heterogeneous node/edge/graph features?</strong></p><ul><li>Non-geometric graphs: Relative information transfers (such as prediction differences in <em>GraphAny</em> or relational interactions in <em>Ultra</em>), absolute information does not.</li><li>Geometric graphs: transfer is easier thanks to the fixed set of atoms, but models have to learn some notion of physics to be reliable</li></ul><p><strong>➡️ How to generalize across prediction tasks?</strong></p><ul><li>To date, there is no single model (among non-geometric GNNs) that would be able to perform node classification, link prediction, and graph classification in the zero-shot inference mode.</li><li>Framing all tasks through the lens of one might help, eg, node classification can be framed as link prediction.</li></ul><p><strong>➡️ What is the optimal model expressivity?</strong></p><ul><li>Node classification, link prediction, and graph classification leverage different symmetries.</li><li>Blunt application of maximally expressive models quickly leads to exponential runtime complexity or enormous memory costs — need to maintain the <em>expressivity vs efficiency</em> balance.</li><li>The link between expressivity, sample complexity (how much training data you need), and inductive generalization is still unknown.</li></ul><p><strong>➡️ Data</strong></p><ul><li>Openly available graph data is orders of magnitude smaller than text/vision data, models have to be sample-efficient.</li><li>Scaling laws are at the emerging stage, it is still unclear what to scale — number of nodes? Edges? Motifs? What is the notion of a token in graphs?</li><li>Geometric GNNs: there is much more experimental data available that makes little sense to domain experts but might be of value to neural nets.</li></ul><ol><li>Mao, Chen, et al. <a href="https://arxiv.org/abs/2402.02216">Graph Foundation Models Are Already Here</a>. ICML 2024</li><li>Morris et al. <a href="https://arxiv.org/abs/2402.02287">Future Directions in Foundations of Graph Machine Learning</a>. ICML 2024</li><li>Zhao et al. <a href="https://arxiv.org/abs/2405.20445">GraphAny: A Foundation Model for Node Classification on Any Graph</a>. Arxiv 2024. <a href="https://github.com/DeepGraphLearning/GraphAny">Code on Github</a></li><li>Dong et al. <a href="https://arxiv.org/abs/2402.07738">Universal Link Predictor By In-Context Learning on Graphs</a>, arxiv 2024</li><li>Zhang et al. <a href="https://arxiv.org/abs/2010.16103">Labeling Trick: A Theory of Using Graph Neural Networks for Multi-Node Representation Learning</a>. NeurIPS 2021</li><li>Chamberlain, Shirobokov, et al. <a href="https://arxiv.org/abs/2209.15486">Graph Neural Networks for Link Prediction with Subgraph Sketching</a>. ICLR 2023</li><li>Zhu et al. <a href="https://arxiv.org/abs/2106.06935">Neural Bellman-Ford Networks: A General Graph Neural Network Framework for Link Prediction</a>. NeurIPS 2021</li><li>Galkin et al. <a href="https://openreview.net/forum?id=jVEoydFOl9">Towards Foundation Models for Knowledge Graph Reasoning</a>. ICLR 2024</li><li>Galkin et al. <a href="https://arxiv.org/abs/2404.07198">Zero-shot Logical Query Reasoning on any Knowledge Graph</a>. arxiv 2024. <a href="https://github.com/DeepGraphLearning/ULTRA">Code on Github</a></li><li>Ibarz et al. <a href="https://proceedings.mlr.press/v198/ibarz22a/ibarz22a.pdf">A Generalist Neural Algorithmic Learner</a> LoG 2022</li><li>Markeeva, McLeish, Ibarz, et al. <a href="https://arxiv.org/abs/2406.04229">The CLRS-Text Algorithmic Reasoning Language Benchmar</a>k. arxiv 2024</li><li>Shoghi et al. <a href="https://arxiv.org/abs/2310.16802">From Molecules to Materials: Pre-training Large Generalizable Models for Atomic Property Prediction</a>. ICLR 2024</li><li>Zhang, Liu et al. <a href="https://arxiv.org/abs/2312.15492">DPA-2: Towards a universal large atomic model for molecular and material simulation</a>, arxiv 2023</li><li>Batatia et al. <a href="https://arxiv.org/abs/2401.00096">A foundation model for atomistic materials chemistry</a>, arxiv 2024</li><li>Yang et al. <a href="https://arxiv.org/abs/2405.04967">MatterSim: A Deep Learning Atomistic Model Across Elements, Temperatures and Pressures</a>, arxiv 2024</li><li>Rives et al.<a href="https://www.pnas.org/doi/full/10.1073/pnas.2016239118"> Biological Structure and Function Emerge from Scaling Unsupervised Learning to 250 Million Protein Sequences</a>. PNAS 2021</li><li>Lin, Akin, Rao, Hie, et al. <a href="https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1">Language models of protein sequences at the scale of evolution enable accurate structure prediction</a>. Science 2023. <a href="https://github.com/facebookresearch/esm">Code</a></li><li>Morgan HL (1965) <a href="https://pubs.acs.org/doi/abs/10.1021/c160017a018">The generation of a unique machine description for chemical structures — a technique developed at chemical abstracts service</a>. J Chem Doc 5:107–113.</li><li>Kläser, Banaszewski, et al. <a href="https://arxiv.org/pdf/2404.14986">MiniMol: A Parameter Efficient Foundation Model for Molecular Learning</a>, arxiv 2024</li><li>Sypetkowski, Wenkel et al. <a href="https://arxiv.org/abs/2404.11568v1">On the Scalability of GNNs for Molecular Graphs</a>, arxiv 2024</li><li>Morris et al. <a href="https://arxiv.org/abs/2402.02287">Future Directions in Foundations of Graph Machine Learning</a>. ICML 2024</li><li>Liu et al. <a href="https://arxiv.org/abs/2402.02054">Neural Scaling Laws on Graphs</a>, arxiv 2024</li><li>Frey et al. <a href="https://www.nature.com/articles/s42256-023-00740-3">Neural scaling of deep chemical models</a>, Nature Machine Intelligence 2023</li></ol><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=f363e2576f58" width="1" height="1" alt=""><hr><p><a href="https://medium.com/data-science/foundation-models-in-graph-geometric-deep-learning-f363e2576f58">Foundation Models in Graph &amp; Geometric Deep Learning</a> was originally published in <a href="https://medium.com/data-science">TDS Archive</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Graph & Geometric ML in 2024: Where We Are and What’s Next (Part II — Applications)]]></title>
            <link>https://medium.com/data-science/graph-geometric-ml-in-2024-where-we-are-and-whats-next-part-ii-applications-1ed786f7bf63?source=rss-4d4f8ddd1e68------2</link>
            <guid isPermaLink="false">https://medium.com/p/1ed786f7bf63</guid>
            <category><![CDATA[aritificial-intelligence]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[deep-dives]]></category>
            <category><![CDATA[graph-machine-learning]]></category>
            <category><![CDATA[geometric-deep-learning]]></category>
            <dc:creator><![CDATA[Michael Galkin]]></dc:creator>
            <pubDate>Tue, 16 Jan 2024 05:59:01 GMT</pubDate>
            <atom:updated>2024-01-18T19:45:00.945Z</atom:updated>
            <content:encoded><![CDATA[<h4>State-of-the-Art Digest</h4><h3>Graph &amp; Geometric ML in 2024: Where We Are and What’s Next (Part II — Applications)</h3><h4>Following the tradition from previous years, we interviewed a cohort of distinguished and prolific academic and industrial experts in an attempt to summarise the highlights of the past year and predict what is in store for 2024. Past 2023 was so ripe with results that we had to break this post into two parts. This is Part II focusing on applications, see also <a href="https://towardsdatascience.com/graph-geometric-ml-in-2024-where-we-are-and-whats-next-part-i-theory-architectures-3af5d38376e1">Part I</a> for theory &amp; new architectures.</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Lz_A1l6i036AtJ-FBFOe2w.png" /><figcaption>Image by Authors with some help from DALL-E 3.</figcaption></figure><p><em>The post is written and edited by </em><a href="https://twitter.com/michael_galkin"><em>Michael Galkin</em></a><em> and </em><a href="https://twitter.com/mmbronstein"><em>Michael Bronstein</em></a><em> with significant contributions from </em><a href="https://twitter.com/dom_beaini"><em>Dominique Beaini</em></a><em>, </em><a href="https://twitter.com/nathanbenaich"><em>Nathan Benaich</em></a><em>, </em><a href="https://twitter.com/bose_joey"><em>Joey Bose</em></a><em>, </em><a href="https://twitter.com/jo_brandstetter"><em>Johannes Brandstetter</em></a><em>, </em><a href="https://twitter.com/befcorreia"><em>Bruno Correia</em></a><em>, </em><a href="https://twitter.com/Ahmed_AI035"><em>Ahmed Elhag</em></a><em>, </em><a href="https://twitter.com/KexinHuang5"><em>Kexin Huang</em></a><em>, </em><a href="https://twitter.com/chaitjo"><em>Chaitanya Joshi</em></a><em>, </em><a href="https://twitter.com/leonklein26"><em>Leon Klein</em></a><em>, </em><a href="https://twitter.com/anoopnm007"><em>N M Anoop Krishnan</em></a><em>, </em><a href="https://twitter.com/WillLin1028"><em>Chen Lin</em></a><em>, </em><a href="https://twitter.com/loukasa_tweet"><em>Andreas Loukas</em></a><em>, </em><a href="https://www.linkedin.com/in/santiago-miret"><em>Santiago Miret</em></a><em>, </em><a href="https://twitter.com/NaefLuca"><em>Luca Naef</em></a><em>, </em><a href="https://twitter.com/LProkhorenkova"><em>Liudmila Prokhorenkova</em></a><em>, </em><a href="https://twitter.com/emaros96"><em>Emanuele Rossi</em></a><em>, </em><a href="https://twitter.com/HannesStaerk"><em>Hannes Stärk</em></a><em>, </em><a href="https://twitter.com/AlexanderTong7"><em>Alex Tong</em></a><em>, </em><a href="https://twitter.com/tsitsulin_"><em>Anton Tsitsulin</em></a><em>, </em><a href="https://twitter.com/PetarV_93"><em>Petar Veličković</em></a><em>, </em><a href="https://twitter.com/MinkaiX"><em>Minkai Xu</em></a><em>, and </em><a href="https://twitter.com/zhu_zhaocheng"><em>Zhaocheng Zhu</em></a><em>.</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*Z5Ncv43RzAe1o-yI" /><figcaption>Geometric ML methods and applications filled the covers of high-profile journals in 2023 (Figure sources: the papers by <a href="https://www.nature.com/articles/s42256-023-00609-5">Wang et al.</a>, <a href="https://www.nature.com/articles/s42256-023-00684-8">Viñas et al.</a>, <a href="https://www.nature.com/articles/s42256-023-00716-3">Deng et al.</a>, <a href="https://www.nature.com/articles/s43588-023-00532-0">Weiss et al.</a>, <a href="https://www.nature.com/articles/s42256-023-00744-z">Lagemann et al.</a>, <a href="https://www.nature.com/articles/s43588-023-00563-7">Duan et al.</a>, and <a href="https://www.science.org/doi/10.1126/science.adi2336">Lam et al.</a>)</figcaption></figure><ol><li><a href="#2626">Structural Biology (Molecules &amp; Proteins)</a><br>a. <a href="#2f16">A Structural Biologist’s Perspective</a><br>b. <a href="#ade6">Industrial Perspective</a><br>c. <a href="#7a08">Systems Biology</a></li><li><a href="#6211">Materials Science (Crystals)</a></li><li><a href="#0924">Molecular Dynamics &amp; ML Potentials</a></li><li><a href="#34b8">Geometric Generative Models (Manifolds)</a></li><li><a href="#3f98">BIG Graphs, Scalability: When GNNs are too expensive</a></li><li><a href="#f6d7">Algorithmic Reasoning &amp; Alignment</a></li><li><a href="#5854">Knowledge Graphs: Inductive Reasoning is Solved?</a></li><li><a href="#add1">Temporal Graph Learning</a></li><li><a href="#ad3d">LLMs + Graphs for Scientific Discovery</a></li><li><a href="#3d00">Cool GNN Applications</a></li><li><a href="#986f">Geometric Wall Street Bulletin 💸</a></li></ol><p>The legend we will be using throughout the text:<br>🔥 hot topics <br>💡 year’s highlight<br>🏋️ challenges<br>➡️ current/next developments <br>🔮 predictions/speculations <br>💰 financial transactions</p><h3>Structural Biology (Molecules &amp; Proteins)</h3><p><em>Dominique Beaini (Valence), Joey Bose (Mila &amp; Dreamfold), Michael Bronstein (Oxford), Bruno Correia (EPFL), Michael Galkin (Intel), Kexin Huang (Stanford), Chaitanya Joshi (Cambridge), Andreas Loukas (Genentech), Luca Naef (VantAI), Hannes Stärk (MIT), Minkai Xu (Stanford)</em></p><blockquote>Structural biology was definitely at the forefront of Geometric Deep Learning in 2023.</blockquote><p>Following the 2020 discovery of <a href="https://pubmed.ncbi.nlm.nih.gov/32084340/">halicin</a> as a potential new antibiotic, in 2023, two new antibiotics were discovered with the help of GNNs! First, it is <a href="https://www.nature.com/articles/s41589-023-01349-8">abaucin</a> (by McMaster and MIT), which targets a stubborn pathogen resistant to many drugs. Second, MIT and Harvard researchers <a href="https://www.nature.com/articles/s41586-023-06887-8">discovered a new structural class of antibiotics</a> where the screening process was supported by<a href="https://github.com/chemprop/chemprop"> ChemProp</a>, a suite of GNNs for molecular property prediction. We also observe a convergence of ML and experimental techniques (“lab-in the-loop”) in the recent work on <a href="https://www.science.org/doi/10.1126/science.adi1407">autonomous molecular discovery</a> (a trend we will also see in the Materials Design in the following sections).</p><p><strong>Flow Matching</strong> has been one of the biggest generative ML trends of 2023, allowing for faster sampling and deterministic sampling trajectories compared to diffusion models. The most prominent examples of Flow Matching models we have seen in the biological applications are <strong>FoldFlow</strong> (<a href="https://arxiv.org/abs/2310.02391">Bose, Akhound-Sadegh, et al</a>.) for protein backbone generation, <strong>FlowSite</strong> (<a href="https://arxiv.org/abs/2310.05764">Stärk et al</a>.) for protein binding site design, and <strong>EquiFM</strong> (<a href="https://openreview.net/forum?id=hHUZ5V9XFu">Song, Gong, et al</a>.) for molecule generation.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*UQCCpeJV1xdx7ncGx8CdZw.png" /><figcaption><em>Conditional probability paths learned by different versions of FoldFlow, visualizing the rotation trajectory of a single residue by the action of SO(3) on its homogeneous space </em>𝕊²<em>. Figure source: </em><a href="https://arxiv.org/abs/2310.02391"><em>Bose, Akhound-Sadegh, et al</em></a><em>.</em></figcaption></figure><p>Efficient Flow Matching on complex geometries with necessary equivariances became possible thanks to a handful of theory papers including Riemannian Flow Matching (<a href="https://arxiv.org/abs/2302.03660">Chen and Lipman</a>), Minibatch Optimal Transport (<a href="https://arxiv.org/abs/2302.00482">Tong et al</a>), and Simulation-Free Schrödinger bridges (<a href="https://arxiv.org/abs/2307.03672">Tong, Malkin, Fatras, et al</a>). A great resource to learn Flow Matching with code examples and notebooks is the <a href="https://github.com/atong01/conditional-flow-matching">TorchCFM</a> repo on GitHub as well as talks by <a href="https://www.youtube.com/watch?v=5ZSwYogAxYg">Yaron Lipman</a>, <a href="https://www.youtube.com/watch?v=EPxDI0ytfQU">Joey Bose</a>, <a href="https://www.youtube.com/watch?v=Xl7YNR1-CN8">Hannes Stärk</a>, and <a href="https://www.youtube.com/watch?v=UhDtH7Ia9Ag">Alex Tong</a>.</p><p><strong>Diffusion models</strong> nevertheless continue to be the main workhorse of generative modeling in structural biology. In 2023, we saw several landmark works: <strong>FrameDiff</strong> (<a href="https://arxiv.org/abs/2302.02277">Yim, Trippe, De Bortoli, Mathieu, et al</a>) for protein backbone generation, <strong>EvoDiff</strong> (<a href="https://www.biorxiv.org/content/10.1101/2023.09.11.556673v1">Alamdari et al</a>) for generating protein sequences with discrete diffusion, <strong>AbDiffuser</strong> (<a href="https://arxiv.org/abs/2308.05027">Martinkus et al</a>) for full-atom antibody design with frame averaging and discrete diffusion (and with successful wet lab experiments), <strong>DiffMaSIF</strong> (<a href="https://www.mlsb.io/papers_2023/DiffMaSIF_Surface-based_Protein-Protein_Docking_with_Diffusion_Models.pdf">Sverrison, Akdel, et al</a>) and <strong>DiffDock-PP</strong> (<a href="https://arxiv.org/abs/2304.03889">Ketata, Laue, Mammadov, Stärk, et al</a>) for protein-protein docking, <strong>DiffPack</strong> (<a href="https://arxiv.org/abs/2306.01794">Zhang, Zhang, et al</a>) for side-chain packing, and the Baker Lab published the <strong>RFDiffusion</strong> <strong>all-atom</strong> version (<a href="https://www.biorxiv.org/content/10.1101/2023.10.09.561603v1.full">Krishna, Wang, Ahern, et al</a>). Among latent diffusion model (like Stable Diffusion in image generation applications), <strong>GeoLDM</strong> (<a href="https://arxiv.org/abs/2305.01140">Xu et al</a>) was the first for 3D molecule conformations, followed by <a href="https://openreview.net/forum?id=DP4NkPZOpD">OmniProt</a> for protein sequence-structure generation.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*gzs5s5iJug9SXAyF" /><figcaption>FrameDiff: parameterization of the backbone frame with rotation, translation, and torsion angle for the oxygen atom. Figure Source: <a href="https://arxiv.org/abs/2302.02277">Yim, Trippe, De Bortoli, Mathieu, et al</a></figcaption></figure><p>Finally, Google DeepMind and Isomorphic Labs <a href="https://www.isomorphiclabs.com/articles/a-glimpse-of-the-next-generation-of-alphafold">announced</a> <strong>AlphaFold 2.3</strong> — the latest iteration is significantly improving upon the baselines in 3 tasks: docking benchmarks (almost 2× better than DiffDock on the new <a href="https://arxiv.org/abs/2308.05777">PoseBusters</a> benchmark), protein-nucleic acid interactions, and antibody-antigen prediction.</p><p><strong><em>Chaitanya Joshi (Cambridge)</em></strong></p><p>💡There have been two emerging trends for biomolecular modeling and design that I am very excited about in 2023:</p><p>1️⃣ Going from protein structure prediction to conformational ensemble generation. There were several interesting approaches to the problem, including <a href="https://www.nature.com/articles/s41586-023-06832-9">AlphaFold with MSA clustering</a>, <a href="https://www.nature.com/articles/s41467-023-36443-x">idpGAN</a>, <a href="https://arxiv.org/abs/2306.05445">Distributional Graphormer</a> (a diffusion model), and <a href="https://www.mlsb.io/papers_2023/AlphaFold_Meets_Flow_Matching_for_Generating_Protein_Ensembles.pdf">AlphaFold Meets Flow Matching for Generating Protein Ensembles</a>.</p><p>2️⃣ Modelling of biomolecular complexes and design of biomolecular interactions among proteins + X: <a href="https://www.biorxiv.org/content/10.1101/2023.10.09.561603v1.full">RFdiffusion all-atom</a> and <a href="https://www.biorxiv.org/content/10.1101/2023.12.22.573103v1.full">Ligand MPNN</a>, both from the Baker Lab, are representative examples of the trend towards designing interactions. The new in-development <a href="https://www.isomorphiclabs.com/articles/a-glimpse-of-the-next-generation-of-alphafold">AlphaFold report</a> claims that a unified structure prediction model can outperform or match specialised models across solo protein and protein complex structure prediction as well as protein-ligand and protein-nucleic acid co-folding.</p><blockquote>“However, for all the exciting methodology development in biomolecular modelling and design, perhaps the biggest lesson for the ML community this year should be to focus more on meaningful <strong>in-silico evaluation</strong> and, if possible, <strong>experimental validation</strong>.” — <strong>Chaitanya Joshi </strong>(Cambridge)</blockquote><p>1️⃣ In early 2023, Guolin Ke’s team at DP Technology released two excellent re-evaluation papers highlighting how we may have been largely overestimating the performance of prominent geometric deep learning-based methods for molecular <a href="https://arxiv.org/abs/2302.07061">conformation generation</a> and <a href="https://arxiv.org/abs/2302.07134">docking</a> w.r.t. traditional baselines.</p><p>2️⃣ <a href="https://arxiv.org/abs/2308.07413">PoseCheck</a> and <a href="https://arxiv.org/abs/2308.05777">PoseBusters</a> shed further light on the failure modes of current molecular generation and docking methods. Critically, generated molecules and their 3D poses are often ‘nonphysical’ and contain steric clashes, hydrogen placement issues, and high strain energies.</p><p>3️⃣ Very few papers attempt any experimental validation of new ML ideas. Perhaps collaborating with a wet lab is challenging for those focussed on new methodology development, but I hope that us ML-ers, as a community, will at least be a lot more cautious about the in-silico evaluation metrics we are constantly pushing as we create new models.</p><p><strong><em>Hannes Stärk (MIT)</em></strong></p><p>💡I am reading quite some hype here about Flow Matching, stochastic interpolants, and Rectified Flows (I will call them “Bridge Matching,” or “BM”). I do not think there is much value in just replacing diffusion models with BM in all the existing applications. For pure generative modeling, the main BM advantage is simplicity.</p><p>I think we should instead be excited about BM for the new capabilities it unlocks. For example, training bridges between arbitrary distributions in a simulation-free manner (what are the best applications for this? I basically only saw <a href="https://arxiv.org/abs/2308.16212">retrosynthesis</a> so far.) or solving OT problems as in <a href="https://arxiv.org/abs/2303.16852">DSBM</a> that does so for fluid flow downscaling. Maybe a lot of tools emerged in 2023 (also let us mention <a href="https://arxiv.org/abs/2310.03695">BM with multiple marginals</a>), and in 2024, the community will make good use of them?</p><p><strong><em>Joey Bose (Mila &amp; Dreamfold)</em></strong></p><p>💡 This year we have really seen the rise of geometric generative models from theory to practice. A few standouts for me include <a href="https://arxiv.org/abs/2302.03660">Riemannian Flow Matching</a> — in general any paper by Ricky Chen and Yaron Lipman on these topics is a must-read — and FrameDiff from <a href="https://arxiv.org/abs/2302.02277">Yim et. al</a> which introduced a lot of the important machinery for protein backbone generation. Of course, standing on the shoulders of both RFM and FrameDIff, we built <a href="https://arxiv.org/abs/2310.02391">FoldFlow</a>, a cooler flow-matching approach to protein generative models.</p><blockquote>“Looking ahead, I foresee a lot <strong>more flow matching</strong>-based approaches coming into use. They are better for proteins and longer sequences and can start from any source distribution.” — Joey Bose (Mila &amp; Dreamfold)</blockquote><p>🔮 Moreover, I suspect we will soon see <strong>multi-modal generative models</strong> in this space, such as discrete + continuous models and also conditional models in the same vein as text-conditioned diffusion models for images. Perhaps, we might even see <strong>latent generative models</strong> here given that they scale so well!</p><p><strong><em>Minkai Xu (Stanford)</em></strong></p><blockquote>“This year, the community has further pushed forward the geometric generative models for 3D molecular generation in many perspectives.” — Minkai Xu (Stanford)</blockquote><p><strong>Flow matching</strong>: Ricky and Yaron proposed the Flow Matching method as an alternative to the widely used diffusion models, and EquiFM (<a href="https://openreview.net/forum?id=hHUZ5V9XFu">Song et al</a> and <a href="https://arxiv.org/abs/2306.15030">Klein et al</a>) realize the variant for 3D molecule generation by parameterizing the flow dynamics with equivariant GNNs. In the meantime, <a href="https://arxiv.org/pdf/2310.05297.pdf">FrameFlow</a> and <a href="https://arxiv.org/abs/2310.02391">FoldFlow</a> construct FM models for protein generation.</p><p>🔮Moving forward similar to vision and text domain, people begin to explore generation in the lower-dimensional latent space instead of the complex original data space (<strong>latent generative models</strong>). GeoLDM (<a href="https://arxiv.org/abs/2305.01140">Xu et al</a>) proposed the first latent diffusion model (like Stable Diffusion in CV) for 3D molecule generation, while <a href="https://arxiv.org/abs/2305.04120">Fu et al</a> enjoys similar modeling formulation for large protein generation.</p><h3>A Structural Biologist’s Perspective</h3><p><em>Bruno Correia (EPFL)</em></p><blockquote>“Current generative models still create “garbage” outputs that violate many of the physical and chemical properties that molecules are known to have. The advantage of current generative models is, of course, their speed which affords them the possibility of generating many samples, which then brings to front and center the ability to filter the best generated samples, which in the case of protein design has benefited immensely from the transformative development of AlphaFold2.” — Bruno Correia (EPFL)</blockquote><p>➡️ The next challenge to the community will perhaps be how to infuse generative models with <strong>meaningful physical and chemical priors</strong> to enhance sampling performance and generalization. Interestingly, we have not seen the same remarkable advances (experimentally validated) in applications to small molecule design, which we hope to see during 2024.</p><p>➡️ <strong>The rise of multimodal models.</strong> Generally in biological-related tasks data sparsity is a given and as such strategies to extract the most signal out of the data are essential. One way to try to overcome such limitations is to improve the expressiveness of the data representations and maybe this way obtain more performant neural networks. Likely in the short term, we will be able to explore architectures that encompass several types of representations of the objects of interest and harness the best predictions for the evermore complex tasks we are facing as progressively more of the basic problems get solved. This notion of multimodality is of course intimately related to the overall aim of having models with stronger priors, that in a generative context, honour fundamental constraints of the objects of interest.</p><p>➡️ <strong>The models that know everything</strong>. As the power of machine learning models improves we clearly tend to request a more multi-objective optimization when it comes to attempting to solve real life problems. Taking as an example small molecule generation, thinking from a biochemical perspective the drug design problem starts by having a target to which a small molecule binds, therefore one of the first and most important constraints is that the generative process ought to be conditioned to the protein pocket. However, such a constraint may not be enough to create real small molecules as many of such chemicals are simply impossible or very hard to synthesize, and, therefore, a model that has notions of chemical synthesizability and can integrate such constraints in the search space would be much more useful.</p><p>➡️ <strong>From chemotype to phenotype</strong>. On the grounds of data representation, atomic graph structures together with vector embeddings have reached remarkable results, particularly in the search for new antibiotics. Making accurate predictions of which chemical structures have antimicrobial activity, broadly speaking, is an exercise of phenotype prediction from chemical structure. Due to the simplicity of the approaches used and the impressive results obtained, one would expect that more sophisticated data representations on the molecule end and perhaps together also with richer phenotype assignment could give critical contributions to such an important problem in drug development.</p><h3>Industrial perspective</h3><p><strong><em>Luca Naef (VantAI)</em></strong></p><p>🔥<em>What are the biggest advancements in the field you noticed in 2023?</em></p><p>1️⃣ <strong>Increasing multi-modality &amp; modularity </strong>— as shown by the emergence of initial co-folding methods for both proteins &amp; small molecules, diffusion and non-diffusion-based, to extend on AF2 success: <a href="https://www.biorxiv.org/content/10.1101/2022.12.20.521309v1.full.pdf">DiffusionProteinLigand</a> in the last days of 2022 and <a href="https://www.biorxiv.org/content/10.1101/2023.10.09.561603v1">RFDiffusion</a>, <a href="https://www.isomorphiclabs.com/articles/a-glimpse-of-the-next-generation-of-alphafold">AlphaFold2</a> and <a href="https://www.biorxiv.org/content/10.1101/2023.11.03.565471v1">Umol</a> by end of 2023. We are also seeing models that have sequence &amp; structure co-trained: <a href="https://www.biorxiv.org/content/10.1101/2023.10.01.560349v2">SAProt</a>, <a href="https://www.biorxiv.org/content/10.1101/2023.07.23.550085v1">ProstT5</a>, and sequence, structure &amp; surface co-trained with <a href="https://www.mlsb.io/papers_2023/Pre-training_Sequence_Structure_and_Surface_Features_for_Comprehensive_Protein_Representation_Learning.pdf">ProteinINR</a>. There is a general revival of surface-based methods after a quieter 2021 and 2022: <a href="https://www.mlsb.io/papers_2023/DiffMaSIF_Surface-based_Protein-Protein_Docking_with_Diffusion_Models.pdf">DiffMasif</a>, <a href="https://arxiv.org/abs/2311.17050">SurfDock</a>, and <a href="https://www.biorxiv.org/content/10.1101/2023.12.03.567710v1">ShapeProt</a>.</p><p>2️⃣ <strong>Datasets and benchmarks</strong>. Datasets, especially synthetic/computationally derived: <a href="https://academic.oup.com/nar/article/52/D1/D384/7438909">ATLAS</a> and the <a href="https://mddbr.eu/">MDDB</a> for protein dynamics. <a href="https://www.biorxiv.org/content/10.1101/2023.05.24.542082v1">MISATO</a>, <a href="https://www.nature.com/articles/s41597-022-01882-6">SPICE</a>, <a href="https://www.nature.com/articles/s41597-023-02443-1">Splinter</a> for protein-ligand complexes, <a href="https://arxiv.org/abs/2311.01135">QM1B</a> for molecular properties. PINDER: large protein-protein docking dataset with matched apo/predicted pairs and benchmark suite with retrained docking models. <a href="https://chanzuckerberg.github.io/cryoet-data-portal/index.html#">CryoET data portal</a> for CryoET. And a whole host of welcome benchmarks: PINDER, <a href="https://arxiv.org/abs/2308.05777">PoseBusters</a>, and <a href="https://arxiv.org/abs/2308.07413">PoseCheck</a>, with a focus on more rigorous and practically relevant settings.</p><p>3️⃣ <strong>Creative pre-training strategies</strong> to get around the sparsity of diverse protein-ligand complexes. Van-der-mers training (<a href="https://openreview.net/forum?id=UfBIxpTK10">DockGen</a>) &amp; sidechain training strategies in <a href="https://www.biorxiv.org/content/10.1101/2023.10.09.561603v1">RF-AA</a> and pre-training on ligand-only complexes in CCD in <a href="https://www.biorxiv.org/content/10.1101/2023.10.09.561603v1">RF-AA</a>. Multi-task pre-training <a href="https://openreview.net/forum?id=6K2RM6wVqKu">Unimol</a> and others.</p><p>🏋️ <em>What are the open challenges that researchers might overlook?</em></p><p>1️⃣ <strong>Generalization. </strong><a href="https://openreview.net/forum?id=UfBIxpTK10">DockGen</a><strong> </strong>showed that current state-of-the-art protein-ligand docking models completely lose predictability when asked to generalise towards novel protein domains. We see a similar phenomenon in the <a href="https://storage.googleapis.com/deepmind-media/DeepMind.com/Blog/a-glimpse-of-the-next-generation-of-alphafold/alphafold_latest_oct2023.pdf">AlphaFold-lastest report</a>, where performance on novel proteins &amp; ligands drops heavily to below biophysics-based baselines (which have access to holo structures), despite very generous definitions of novel protein &amp; ligand. This indicates that existing approaches might still largely rely on memorization, an observation that has been extensively argued over the <a href="https://pubs.acs.org/doi/10.1021/acs.jmedchem.2c00487">years</a></p><p>2️⃣ <strong>The curse of (simple) baselines. </strong>A recurring topic over the years, 2023 has again shown what industry practitioners have long known: in many practical problems such as molecular generation, property prediction, docking, and conformer prediction, simple baselines or classical approaches often still outperform ML-based approaches in practice. This has been documented increasingly in 2023 by <a href="https://arxiv.org/abs/2310.09267">Tripp et al.</a><strong>, </strong><a href="https://arxiv.org/abs/2302.07134">Yu et al.</a>, <a href="https://arxiv.org/abs/2302.07061">Zhou et al.</a></p><p>🔮 <em>Predictions for 2024!</em></p><blockquote>“In 2024, data sparsity will remain top of mind and we will see a lot of smart ways to use models to generate synthetic training data. Self-distillation in AlphaFold2 served as a big inspiration, Confidence Bootstrapping in <a href="https://openreview.net/forum?id=UfBIxpTK10">DockGen</a>, leveraging the insight that we now have sufficiently powerful models that can score poses but not always generate them, first realised in <a href="https://www.biorxiv.org/content/10.1101/2022.03.11.484043v1">2022</a>.” — Luca Naef (VantAI)</blockquote><p>2️⃣ We will see more biological/chemical assays purpose-built for ML or only making sense in a machine learning context (i.e., might not lead to biological insight by themselves but be primarily useful for training models). An example from 2023 is the large-scale protein folding experiments by <a href="https://www.nature.com/articles/s41586-023-06328-6">Tsuboyama et al.</a> This move might be driven by techbio startups, where we have seen the first foundation models built on such ML-purpose-built assays for structural biology with e.g. <a href="https://www.biorxiv.org/content/10.1101/2023.12.13.571579v1">ATOM-1</a>.</p><p><strong><em>Andreas Loukas (Prescient Design, part of Genentech)</em></strong></p><p>🔥 <em>What are the biggest advancements in the field you noticed in 2023?</em></p><blockquote>“In 2023, we started to see some of the challenges of equivariant generation and representation for proteins to be resolved through diffusion models.” — Andreas Loukas (Prescient Design)</blockquote><p>1️⃣ We also noticed a <strong>shift towards approaches that model and generate molecular systems at higher fidelity</strong>. For instance, the most recent models adopt a fully end-to-end approach by generating backbone, sequence and side-chains jointly (<a href="https://openreview.net/pdf?id=7GyYpomkEa">AbDiffuser</a>, <a href="https://arxiv.org/pdf/2302.00203.pdf">dyMEAN</a>) or at least solve the problem in two steps but with a partially joint model (<a href="https://www.nature.com/articles/s41586-023-06728-8">Chroma</a>); as compared to backbone generation followed by inverse folding as in <a href="https://www.nature.com/articles/s41586-023-06415-8">RFDiffusion</a> and <a href="https://openreview.net/pdf?id=m8OUBymxwv">FrameDiff</a>. Other attempts to improve the modelling fidelity can be found in the latest updates to co-folding tools like <a href="https://www.isomorphiclabs.com/articles/a-glimpse-of-the-next-generation-of-alphafold">AlphaFold2</a> and <a href="https://www.biorxiv.org/content/10.1101/2023.10.09.561603v1">RFDiffusion</a> which render them sensitive to non-protein components (ligands, prosthetic groups, cofactors); as well as in papers that attempt to account for conformational dynamics (see discussion above). In my view, this line of work is essential because the binding behaviour of molecular systems can be very sensitive to how atoms are placed, move, and interact.</p><p>2️⃣ In 2023, many works also attempted to get a handle on <strong>binding affinity</strong> by learning to predict the effect of mutations of a known crystal by pre-training on large corpora, such as computationally predicted mutations (<a href="https://github.com/oxpig/Graphinity">graphinity</a>), and on side-tasks, such as <a href="https://openreview.net/pdf?id=_X9Yl1K2mD">rotamer density estimation</a>. The obtained results are encouraging as they can significantly outperform semi-empirical baselines like Rosetta and FoldX. However, there is still significant work to be done to render these models reliable for binding affinity prediction.</p><p>3️⃣ I have further observed a growing recognition of <strong>protein Language Models (pLMs)<em> </em></strong>and specifically <a href="https://www.science.org/doi/10.1126/science.ade2574">ESM</a> as valuable tools, even among those who primarily favour geometric deep learning. These embeddings are used to help docking models, allow the construction of simple yet competitive predictive models for binding affinity prediction (<a href="https://www.nature.com/articles/s41467-023-39022-2">Li et al 2023</a>), and can generally offer an efficient method to create residue representations for GNNs that are informed by the extensive proteome data without the need for extensive pretraining (<a href="https://www.mlsb.io/papers_2023/Evaluating_Representation_Learning_on_the_Protein_Structure_Universe.pdf">Jamasb et al 2023</a>). However, I do maintain a concern regarding the use of pLMs: it is unclear whether their effectiveness is due to data leakage or genuine generalisation. This is particularly pertinent when evaluating models on tasks like amino-acid recovery in inverse folding and conditional CDR design, where distinguishing between these two factors is crucial.</p><p>🏋️ <em>What are the open challenges that researchers might overlook?</em></p><p>1️⃣ Working with <strong>energetically relaxed crystal structures</strong> (and, even worse, folded structures) can significantly affect the performance of downstream predictive models. This is especially true for the prediction of protein-protein interactions (PPIs). In my experience, the performance of PPI predictors severely deteriorates when they are given a relaxed structure as opposed to the binding (holo) crystalised structure.</p><p>2️⃣ Though successful <em>in silico </em>antibody design has the capacity to revolutionise drug design, <strong>general protein models are not (yet?) as good at folding, docking or generating antibodies as antibody-specific models are</strong>. This is perhaps due to the low conformational variability of the antibody fold and the distinct binding mode between antibodies and antigens (loop-mediated interactions that can involve a non-negligible entropic component). Perhaps for the same reasons, the <em>de novo</em> design of antibody binders (that I define as 0-shot generation of an antibody that binds to a previously unseen epitope) remains an open problem. Currently, experimentally confirmed cases of <em>de novo</em> binders involve mostly stable proteins, like <a href="https://www.nature.com/articles/s41586-023-06415-8">alpha-helical bundles</a>, that are common in the PDB and harbour interfaces that differ substantially from epitope-paratope interactions.</p><p>3️⃣ <strong>We are still lacking a general-purpose proxy for binding free energy</strong><em>.</em> The main issue here is the lack of high-quality data of sufficient size and diversity (esp. co-crystal structures). We should therefore be cognizant of the limitations of any such learned proxy for any model evaluation: though predicted binding scores that are out of distribution of known binders is a clear signal that something is off, we should avoid the typical pitfall of trying to demonstrate the superiority of our model in an empirical evaluation by showing how it leads to even higher scores.</p><p><strong><em>Dominique Beaini (Valence Labs, part of Recursion)</em></strong></p><blockquote>“I’m excited to see a very large community being built around the problem of drug discovery, and I feel we are on the brink of a new revolution in the speed and efficiency of discovering drugs.” — Dominique Beaini (Valence Labs)</blockquote><p><em>What work got me excited in 2023?</em></p><p>I am confident that machine learning will allow us to tackle rare diseases quickly, stop the next COVID-X pandemic before it can spread, and live longer and healthier. But there’s a lot of work to be done and there are a lot of challenges ahead, some bumps in the road, and some canyons on the way. Speaking of communities, you can visit the <a href="https://portal.valencelabs.com/">Valence Portal</a> to keep up-to-date with the 🔥 new in ML for drug discovery.</p><p><em>What are the hard questions for 2024?</em></p><p>⚛️ <strong>A new generation of quantum mechanics.</strong> Machine learning force-fields, often based on equivariant and invariant GNNs, have been promising us a treasure. The treasure of the precision of density functional theory, but thousands of times faster and at the scale of entire proteins. Although some steps were made in this direction with <a href="https://link.springer.com/chapter/10.1007/978-3-031-32041-5_12">Allegro</a> and <a href="https://arxiv.org/pdf/2401.00096.pdf">MACE-MP</a>, current models do not generalize well to unseen settings and very large molecules, and they are still too slow to be applicable on the timescale that is needed 🐢. For the generalization, I believe that bigger and more diverse datasets are the most important stepping stones. For the computation time, I believe we will see models that are less enforcing of the equivariance, such as <a href="https://arxiv.org/pdf/2305.05577.pdf">FAENet</a>. But efficient sampling methods will play a bigger role: spatial-sampling such as using <a href="https://arxiv.org/abs/2210.01776">DiffDock</a> to get more interesting starting points and time-sampling such as <a href="https://www.microsoft.com/en-us/research/publication/timewarp-transferable-acceleration-of-molecular-dynamics-by-learning-time-coarsened-dynamics/">TimeWarp</a> to avoid simulating every frame. I’m really excited by the big STEBS 👣 awaiting us in 2024: Spatio-temporal equivariant Boltzmann samplers.</p><p>🕸️ <strong>Everything is connected. Biology is inherently multimodal 🙋🐁 🧫🧬🧪.</strong> One cannot simply decouple the molecule from the rest of the biological system. Of course, that’s how ML for drug discovery was done in the past: simply build a model of the molecular graph and fit it to experimental data. But we have reached a critical point 🛑, no matter how many trillion parameters are in the GNN model is, and how much data are used to train it, and how many experts are mixtured together. It is time to bring biology into the mix, and the most straightforward way is with multi-modal models. One method is to condition the output of the GNNs with the target protein sequences such as <a href="https://www.biorxiv.org/content/10.1101/2023.09.13.557595v4.abstract">MocFormer</a>. Another is to use microscopy images or transcriptomics to better inform the model of the biological signature of molecules such as <a href="https://www.biorxiv.org/content/10.1101/2023.11.12.566777v1.full">TranSiGen</a>. Yet another is to use LLMs to embed contextual information about the tasks such as <a href="https://arxiv.org/pdf/2401.04478.pdf">TwinBooster</a>. Or even better, combining all of these together 🤯, but this could take years. The main issue for the broader community seems to be the availability of large amounts of quality and standardized data, but fortunately, this is not an issue for Valence.</p><p><strong>🔬 Relating biological knowledge and observables. </strong>Humans have been trying to map biology for a long time, building relational maps for genes 🧬, protein-protein interactions 🔄, metabolic pathways 🔀, etc. I invite you to read this <a href="https://academic.oup.com/bib/article/23/6/bbac404/6712301">review of knowledge graphs for drug discovery</a>. But all this knowledge often sits unused and ignored by the ML community. I feel that this is an area where GNNs for knowledge graphs could prove very useful, especially in 2024, and it could provide another modality for the 🕸️ point above. Considering that human knowledge is incomplete, we can instead recover relational maps from foundational models. This is the route taken by <a href="https://arxiv.org/abs/2309.16064">Phenom1 </a>when trying to recall known genetic relationships. However, having to deal with various knowledge databases is an extremely complex task that we can’t expect most ML scientists to be able to tackle alone. But with the help of artificial assistants like <a href="https://www.valencelabs.com/lowe">LOWE</a>, this can be done in a matter of seconds.</p><p><strong>🏆 Benchmarks, benchmarks, benchmarks.</strong> I can’t repeat the word <strong><em>benchmark</em></strong> enough. Alas, benchmarks will stay the unloved kid on the ML block 🫥. But if the word benchmark is uncool, its cousin <strong><em>competition</em></strong> is way cooler 😎! Just as the <a href="https://ogb.stanford.edu/docs/lsc/">OGB-LSC</a> competition and <a href="https://opencatalystproject.org/challenge.html">Open Catalyst</a> challenge played a major role for the GNN community, it is now time for a new series of competitions 🥇. We even got the <a href="https://tgb.complexdatalab.com/">TGB (Temporal graph benchmark)</a> recently. If you were at NeurIPS’23, then you probably heard of Polaris coming up early 2024 ✨. Polaris is a consortium of multiple pharma and academic groups trying to improve the quality of available molecular benchmarks to better represent real drug discovery. Perhaps we’ll even see a benchmark suitable for molecular graph generation instead of optimizing QED and cLogP, but I wouldn’t hold my breath, I have been waiting for years. What kind of new, crazy competition will light up the GDL community this year 🤔?</p><h3>Systems Biology</h3><p><strong><em>Kexin Huang (Stanford)</em></strong></p><p>Biology is an interconnected, multi-scale, and multi-modal system. Effective modeling of this system can not only unravel fundamental biological questions but also significantly impact therapeutic discovery. The most natural data format for encapsulating this system is a relational database or a heterogeneous graph. This graph stores data from decades of wet lab experiments across various biological modalities, scaling up to billions of data points.</p><blockquote>“In 2023, we witnessed a range of innovative applications using GNNs on these biological system graphs. These applications have unlocked new biomedical capabilities and answered critical biological queries.” — Kexin Huang (Stanford)</blockquote><p>1️⃣ One particularly exciting field is <strong>perturbative biology</strong>. Understanding the outcomes of perturbations can lead to advancements in cell reprogramming, target discovery, and synthetic lethality, among others. In 2023, <a href="https://www.nature.com/articles/s41587-023-01905-6">GEARS</a> applies GNN to gene perturbational relational graphs and it predicts outcomes of genetic perturbations that have not been observed before.</p><p>2️⃣ Another cool application concerns <strong>protein representation</strong>. While current protein representations are fixed and static, we recognize that the same protein can exhibit different functions in varying cellular contexts. <a href="https://www.biorxiv.org/content/10.1101/2023.07.18.549602v1">PINNACLE</a> uses GNN on protein interaction networks to contextualize protein embeddings. This approach has shown to enhance 3D structure-based protein representations and outperform existing context-free models in identifying therapeutic targets.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/804/0*LdQPRd76Wnb9BDeH" /><figcaption>PINNACLE has protein-, cell type-, and tissue-level attention mechanisms that enable the algorithm to generate contextualized representations of proteins, cell types, and tissues in a single unified embedding space. Source: <a href="https://www.biorxiv.org/content/10.1101/2023.07.18.549602v1">Li et al</a></figcaption></figure><p>3️⃣ GNNs also have shown a vital role in <strong>diagnosing rare diseases</strong>. <a href="https://www.medrxiv.org/content/10.1101/2022.12.07.22283238v1">SHEPHERD</a> utilizes GNN over massive knowledge graph to encode extensive biological knowledge into the ML model and is shown to facilitate causal gene discovery, identify ‘patients-like-me’ with similar genes or diseases, and provide interpretable insights into novel disease manifestations.</p><p>➡️ Moving beyond predictions, understanding the underlying mechanisms of biological phenomena is crucial. <strong>Graph XAI</strong> applied to system graphs is a natural fit for identifying mechanistic pathways. <a href="https://www.medrxiv.org/content/10.1101/2023.03.19.23287458v2">TxGNN</a>, for example, grounds drug-disease relation predictions in the biological system graph, generating multi-hop interpretable paths. These paths rationalize the potential of a drug in treating a specific disease. TxGNN designed <a href="http://txgnn.org/">visualizations</a> for these interpretations and conducted user studies, proving their decision-making effectiveness for clinicians and biomedical scientists.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*5frNFjOVtUNiUyvE" /><figcaption>A web-based graphical user interface to support clinicians and scientists in exploring and analyzing the predictions and explanations generated by TxGNN. The ‘Control Panel‘ allows users to select the disease of interest and view the top-ranked TXGNN predictions for the query disease. The ‘edge threshold‘ module enables users to modify the sparsity of the explanation and thereby control the density of the multi-hop paths displayed. The ‘Drug Embedding‘ panel allows users to compare the position of a selected drug relative to the entire repurposing candidate library. The ‘Path Explanation‘ panel displays the biological relations that have been identified as crucial for TXGNN’s predictions regarding therapeutic use. Source: <a href="https://www.medrxiv.org/content/10.1101/2023.03.19.23287458v2">Huang, Chandar, et al</a></figcaption></figure><p>➡️ Foundation models in biology have predominantly been unimodal (focused on proteins, molecules, diseases, etc.), primarily due to the scarcity of paired data. <strong>Bridging across modalities</strong> to answer multi-modal queries is an exciting frontier. For example, <a href="https://openreview.net/forum?id=jJCeMiwHdH">BioBridge</a> leverages biological knowledge graphs to learn transformations across unimodal foundation models, enabling multi-modal behaviors.</p><p>🔮 GNNs applied to system graphs have the potential to (1) encode vast biomedical knowledge, (2) bridge biological modalities, (3) provide mechanistic insights, and (4) contextualize biological entities. We anticipate even more groundbreaking applications of GNN in biology in 2024, addressing some of the most pressing questions in the field.</p><h4><strong>Predictions from the 2023 post</strong></h4><p>(1) performance improvements of diffusion models such as faster sampling and more efficient solvers;<br>✅ yes, with flow matching</p><p>(2) more powerful conditional protein generation models;<br>❌ Chroma and RFDiffusion are still on top</p><p>(3) more successful applications of <a href="https://arxiv.org/abs/2111.09266">Generative Flow Networks</a> to molecules and proteins<br>❌ yet to be seen</p><h3>Materials Science (Crystals)</h3><p><em>Michael Galkin (Intel) and Santiago Miret (Intel)</em></p><p>In 2023, for a short period, all scientific news were talking only about <a href="https://en.wikipedia.org/wiki/LK-99">LK-99</a> — a supposed room-temperature superconductor created by a Korean team (spoiler: <a href="https://www.nature.com/articles/d41586-023-02585-7">it did not work as of now</a>).</p><blockquote>This highlights the huge potential ML has in material science, where perhaps the biggest progress of the year has happened — we can now say that materials science and materials discovery are first-class citizens in the Geometric DL landscape.</blockquote><p>💡The advances of Geometric DL applied to materials science and discovery saw significant advances across new modelling methods, creation of new benchmarks and datasets, automated design with generative methods, and identifying new research questions based on those advances.</p><p>1️⃣ Applications of geometric models as evaluation tools in automated discovery workflows. The <a href="https://github.com/IntelLabs/matsciml">Open MatSci ML Toolkit </a>consolidated all open-sourced crystal structures datasets leading to 1.5 million data points for ground-state structure calculations that are now easily available for model development. The<a href="https://arxiv.org/abs/2309.05934"> authors’ initial results</a> seem to indicate that merging datasets seems to improve performance if done attentively.</p><p>2️⃣ <a href="https://arxiv.org/abs/2308.14920">MatBench Discovery</a> is another good example of this integration of geometric models as an evaluation tool for crystal stability, which tests models’ predictions of the <strong>energy above hull</strong> for various crystal structures. The energy above hull is the most reliable approximation of crystal structure stability and also represents an improvement in metrics compared to formation energy or raw energy prediction which have practical limitations as stability metrics.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/978/1*ryQ3qBUoPRyR4JZ_vuylOQ.png" /><figcaption>Universal potentials are more reliable classifiers because they exit the red triangle earliest. These lines show the rolling MAE on the WBM test set as the energy to the convex hull of the MP training set is varied, lower is better. The red-highlighted ’triangle of peril’ shows where the models are most likely to misclassify structures. As long as a model’s rolling MAE remains inside the triangle, its mean error is larger than the distance to the convex hull. If the model’s error for a given prediction happens to point towards the stability threshold at 0 eV from the hull (the plot’s center), its average error will change the stability classification of a material from true positive/negative to false negative/positive. The width of the ’rolling window’ box indicates the width over which errors hull distance prediction errors were averaged. Source: <a href="https://arxiv.org/abs/2308.14920">Riebesell et al</a></figcaption></figure><p>3️⃣ In terms of new geometric models for crystal structure prediction, <strong>Crystal Hamiltonian Graph neural network</strong> (<a href="https://chgnet.lbl.gov/">CHGNet</a>, <a href="https://arxiv.org/abs/2302.14231">Deng et al</a>) is a new GNN trained on static and relaxation trajectories of Materials Project that shows quite competitive performance compared to prior methods. The development of CHGNet suggests that finding better training objectives will be as (if not more) important than the development of new methods as the intersection of materials science and geometric deep learning continues to grow.</p><p>🔥 The other proof points of the further integration of Geometric DL and materials discovery are several massive works by big labs focused on crystal structure discovery with generative methods:</p><p>1️⃣ Google DeepMind released <a href="https://deepmind.google/discover/blog/millions-of-new-materials-discovered-with-deep-learning/"><strong>GNoME</strong></a> (Graph Networks for Materials Science by <a href="https://www.nature.com/articles/s41586-023-06735-9">Merchant et al</a>) as a successful example of an active learning pipeline for discovering new materials, and <a href="https://unified-materials.github.io/unimat/">UniMat</a> as an<em> ab initio</em> crystal generation model. Similar to the protein world, we see more examples of automated labs for materials science (“lab-in-the-loop”) such as the <a href="https://www.nature.com/articles/s41586-023-06734-w">A-Lab from UC Berkley</a>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*AqWfkbgEvL_t02xy" /><figcaption>The active learning loop of GNoME. Source: <a href="https://www.nature.com/articles/s41586-023-06735-9">Merchant et al.</a></figcaption></figure><p>2️⃣ Microsoft Research released <a href="https://www.microsoft.com/en-us/research/blog/mattergen-property-guided-materials-design/">MatterGen</a>, a generative model for unconditional and property-guided materials design, and <a href="https://distributionalgraphormer.github.io/">Distributional Graphormer</a>, a generative model trained to recover the equilibrium energy distribution of a molecule/protein/crystal.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*7Pq4uOFCOICQHbEg" /><figcaption>Unconditional and conditional generation of MatterGen. Source: <a href="https://arxiv.org/abs/2312.03687">Zeni, Pinsler, Zügner, Fowler, Horton, et al.</a></figcaption></figure><p>3️⃣ Meta AI and CMU released the<a href="https://open-catalyst.metademolab.com/"> Open Catalyst Demo</a> where you can play around with relaxations (DFT approximations) of 11.5k catalyst materials on 86 adsorbates in 100 different configurations each (making it up to 100M combinations). The demo is powered by SOTA geometric models GemNet-OC and Equiformer-V2.</p><p><strong><em>Santiago Miret (Intel)</em></strong></p><p>While those works represent large-scale deployments of generative methods, there is also new work on using reinforcement learning (<a href="https://openreview.net/forum?id=VbjD8w2ctG">Govindarajan et al.</a>, <a href="https://openreview.net/forum?id=MNfVMjsL7S">Lacombe et al.</a>) and GFlowNets (<a href="https://openreview.net/forum?id=l167FjdPOv">Mistal et al.</a>, <a href="https://openreview.net/forum?id=dJuDv4MKLE">Nguyen et al.</a>) with geometric DL for crystal structure discovery as highlighted in the <a href="https://sites.google.com/view/ai4mat">AI for Accelerated Materials Design (AI4Mat)</a> workshop at NeurIPS’23. AI4Mat-2023 itself saw rapid expansion in participation with a 2× increase in the number of submitted and accepted papers and almost tripling in the number of attendees.</p><p>💡 Geometric DL and GNNs continue to be a major part of AI4Mat’s research content as we saw increased application of methods not only for property prediction but also for improving <strong>chemical synthesis</strong> and <strong>material characterization</strong>. One such promising example highlighted in the AI4Mat-2023 workshop is <strong>KREED</strong> (<a href="https://openreview.net/forum?id=jlZrTCccAb">Cheng, Lo, et al</a>), which uses equivariant diffusion to predict 3D structures of molecules based on incomplete information that can be obtained from real laboratory machines.</p><blockquote>“Given the importance of structural data in material characterization, the discussions at AI4Mat highlighted the opportunities for Geometric DL to enter the space of real-world materials modelling in addition to their continued successes in simulations including ML-based potentials.” — Santiago Miret (Intel)</blockquote><p>🔮 In 2024, I expect to see multiple developments:</p><p>1️⃣ More discovery architectures and workflows that directly integrate geometric models like M3GNet, CHGNet, MACE.</p><p>2️⃣ Geometric models might also see increased competition from text-based representations and LLMs as <a href="https://openreview.net/forum?id=0r5DE2ZSwJ">new methods are being proposed</a> that directly generate CIF files.</p><p>3️⃣ More deployment of geometric models and GNNs into real-world experimental data, likely in materials characterization such as KREED, which will likely run into regimes with less data compared to simulation-based modeling.</p><h3>Molecular Dynamics &amp; ML Potentials</h3><p><em>Michael Galkin (Intel), Leon Klein (FU Berlin), N M Anoop Krishnan (IIT Delhi), Santiago Miret (Intel)</em></p><blockquote>One of the pronounced trends of 2023 is going towards foundation models for ML potentials that work on a variety of compounds from small molecules to periodic crystals</blockquote><p>For example, <strong>JMP</strong> (<a href="https://arxiv.org/abs/2310.16802">Shoghi et al</a>) from FAIR and CMU, <strong>DPA-2 </strong>(<a href="https://arxiv.org/abs/2312.15492">Zhang, Liu, et al</a>) from a large collaboration of Chinese institutions, and <strong>MACE-MP-0</strong> (<a href="https://arxiv.org/abs/2401.00096">Batatia et al</a>) from a collaboration led by Cambridge. Practically, those are geometric GNNs pre-trained in the multi-task mode to predict the energy (or forces) of a certain atomic structure. Another notable mention goes to <strong>Equiformer V2</strong> (<a href="https://arxiv.org/abs/2306.12059">Liao et al</a>) as a strong equivariant transformer that holds SOTA in many tasks including the recent <a href="https://opencatalystproject.org/challenge.html">OpenCatalyst 2023 Challenge</a> and <a href="https://open-dac.github.io/index.html">OpenDAC</a> (Direct Air Capture) challenge.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*pbsN3z6DZbJuFfKc5-eTrg.png" /><figcaption>A foundation model for materials modelling. Trained only on Materials Project data which consists primarily of inorganic crystals and is skewed heavily towards oxides, MACE-MP-0 is capable of molecular dynamics simulation across a wide variety of chemistries in the solid, liquid and gaseous phases. Source: <a href="https://arxiv.org/abs/2401.00096">Batatia et al</a></figcaption></figure><p>⚛️ A common use case for ML potentials is molecular dynamics (MD) which aims to simulate a certain structure on a span of nanoseconds (10ᐨ⁹) to seconds. The main problem is that the fundamental timestep in classical methods is a femtosecond (10ᐨ¹⁵), that is, you’d need at least 1 million steps to simulate a nanosecond and that’s expensive. Modern ML-based methods for MD aim to speed it up by applying coarse-graining and other approximation tricks that accelerate simulations by large margins (30–1000x). <a href="https://openreview.net/forum?id=y8RZoPjEUl">Fu, Xie, et al</a> (TMLR’23) apply coarse-graining to atomic structures and run a GNN over smaller graphs to predict the next-step position. Experimentally, the method brings 1000–10.000x speedups compared to classical methods. <strong>TimeWarp</strong> (<a href="https://arxiv.org/abs/2302.01170">Klein, Foong, Fjelde, Mlodozeniec, et al</a>, NeurIPS’23) can simulate large timesteps (1⁰⁵ — 1⁰⁶ femtoseconds) in a single forward pass by using a conditional normalizing flow model that approximates a distribution of next-step positions. A trained model is used with MCMC sampling and delivers ~33x speedups.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*KfzwdxDqrRUSMK5BIeKmqA.png" /><figcaption>(a) Initial state x(t) (Left) and accepted proposal state x(t+τ) (Right) sampled with Timewarp for the dipeptide HT (unseen during training). (b) TICA projections of simulation trajectories, showing transitions between metastable states, for a short MD simulation (Left) and Timewarp MCMC (Right), both run for 30 minutes of wall-clock time. Timewarp MCMC achieves a speed-up factor of ≈ 33 over MD in terms of effective sample size per second. Source: <a href="https://arxiv.org/abs/2302.01170">Klein, Foong, Fjelde, Mlodozeniec, et al</a></figcaption></figure><p><strong><em>Santiago Miret (Intel)</em></strong></p><p>💡As the deployment of geometric models has seen greater success in property modelling, researchers have pushed the state-of-the-art by testing these models in real-world molecular dynamics simulations. The first work to highlight issues with training models on energy and forces alone was <a href="https://openreview.net/forum?id=A8pqQipwkt">Forces Are Not Enough</a> published in TMLR in early 2023. Nevertheless, advances in neighborhood-based methods such as <a href="https://arxiv.org/abs/2204.05249">Allegro</a> led to the successful deployment of large-scale simulations using geometric deep learning models, including a <a href="https://www.hpcwire.com/off-the-wire/sc23-spotlight-gordon-bell-prize-2023-finalists-showcase-diverse-supercomputing-applications/">nomination for the Gordon Bell Prize</a>.</p><blockquote>“Much work still remains in ensuring successful, generalised deployment of machine learning potentials across a variety of physical and chemical phenomena.” — Santiago Miret (Intel)</blockquote><p>➡️ <a href="https://arxiv.org/abs/2310.02428">EGraffBench</a> highlights some new challenges, such as generalisation across temperatures and materials phase changes (i.e. <em>solid-to-liquid</em> change), and proposes new metrics for evaluating the performance of machine learning potentials in real MD simulations. The AI4Mat-2023 workshop also showcased the development of new ML potentials for specialised use cases, such as <a href="https://openreview.net/forum?id=jtAXitX6dh">solid electrolytes for batteries</a>.</p><p><strong><em>Leon Klein (FU Berlin)</em></strong></p><p>💡 A notable constraint in the application of generative models to sample from the equilibrium Boltzmann distribution was the requirement for retraining with each new system, thereby limiting potential advantages over traditional MD simulations. However, recent advancements have seen the emergence of transferable models across various domains. Our contribution, <a href="https://arxiv.org/abs/2302.01170">Timewarp</a>, presents a transferable model capable of proposing large time steps for MD simulations focused on all atom small peptide systems. Similarly,<a href="https://arxiv.org/abs/2204.10348"> Fu et al.</a> capture the time-coarsened dynamics of coarse-grained polymers, while<a href="https://arxiv.org/abs/2310.18278"> Charron et al.</a> excel in learning a transferable force field for coarse-grained proteins.</p><blockquote>“Consequently, this year has demonstrated the feasibility of transferable generative models for MD simulations, showcasing their potential to speed up such simulations.” — Leon Klein (FU Berlin)</blockquote><p>🔮 In 2024, I expect that more tailored GNNs are used to improve accuracy for the transferable models, with a potential focus on encoding more information about the system. For example, Timewarp, while lacking rotational symmetry in its model, employs data augmentation. Alternatively, rotational symmetry could be incorporated using the recently proposed <a href="https://arxiv.org/abs/2308.10364">SE(3) Equivariant Augmented Coupling Flows</a><strong>. </strong>Similarly, <a href="https://arxiv.org/abs/2310.18278">Charron et al.</a> use a SchNet instead of a more complex GNN.</p><p><strong><em>N M Anoop Krishnan (IIT Delhi)</em></strong></p><blockquote>“One of the most exciting developments for the year in the realm of ML potentials is the development of “universal” interatomic potentials that can span almost all the elements of the periodic table.” — N M Anoop Krishnan (IIT Delhi)</blockquote><p>💡 Following M3GNet in 2022, this year witnessed the developments of three such models based on CHGNet (<a href="https://www.nature.com/articles/s42256-023-00716-3">Deng et al</a>), NequIP (<a href="https://www.nature.com/articles/s41586-023-06735-9">Merchant et al</a>), and MACE (<a href="https://arxiv.org/abs/2401.00096">Batatia et al</a>). These models have been used to demonstrate several challenging tasks including materials discovery (<a href="https://www.nature.com/articles/s41586-023-06735-9">Merchant et al</a>), and diverse set of MD simulations (<a href="https://arxiv.org/abs/2401.00096">Batatia et al</a>) such as phase transition, amorphization, chemical reaction, 2D materials modeling, dissolution, defects, combustion to name a few. These approaches provide promising results towards the universality of these potentials, thereby allowing one to solve challenging problems including the discovery of crystals from their corresponding amorphous structure (<a href="https://arxiv.org/abs/2310.01117">Aykol et al</a>), a long-standing open problem in materials.</p><p>🏋️ While these potentials do provide a handle to attack some outstanding problems, the challenges remain in understanding the scenarios where these potentials can fail.</p><p><strong>1️⃣ </strong>Testing these potentials to their limit to understand their capability is an important aspect to understand their limitations. This includes modeling extreme environments such as <strong>high pressure</strong> and <strong>radiation conditions</strong>, simulating complex multicomponent systems such as <strong>glasses or high-entropy alloys</strong>, or simulating <strong>different phases</strong> of systems such as water or silica would be interesting challenges.</p><p><strong>2️⃣ </strong>While some of these models have been termed as “foundation” models, <strong>emergent behavior </strong>associated with FMs <strong>has not been demonstrated</strong> by them. Most of these models simply show extrapolation capability to potentially unseen regions in the phase space or to novel compositions. Developing truly foundational models in terms of emergent properties would be an interesting challenge.</p><p><strong>3️⃣ </strong>A third aspect that has been paid less attention to is the ability of these models to <strong>simulate at scale</strong>. While <a href="https://arxiv.org/abs/2204.05249">Allegro</a> has demonstrated some capability in terms of length scales these potentials can achieve, simulating at larger time and length scales with stability while respecting the “universality” shall still remain an open challenge for these potentials.</p><p>🔮 <strong>What to expect in 2024?</strong></p><p><strong>1️⃣</strong> <strong>Benchmarking suite</strong>: While there exist several benchmarking studies on MD simulations, it is expected that 2024 will witness more formalized efforts in this direction both in terms of datasets and tasks. A standard set of tasks that can automatically evaluate potentials and place them on leaderboards will enable easy ranking of potentials targeted for downstream tasks on different materials such as metals, polymers, or oxides.</p><p><strong>2️⃣ Model and dataset development</strong>: Further efforts will be made to make ML potentials more compact and efficient in terms of their architectures. Moreover, 2024 will also witness large-scale dataset development that will provide <em>ab initio</em> data for training these potentials.</p><p><strong>3️⃣ Differentiable MD/AIMD</strong>: Further, it is expected that the developments in differentiable simulations will become a major area of fusing experiments and <em>ab initio</em> simulations towards automated development of interatomic potentials for targeted applications. This year may also see advances in differentiable AIMD with machine learned functionals that may allow economical simulations to scale beyond what it has been able to achieve thus far.</p><p><strong>Predictions from the 2023 post</strong></p><p>We expect to see a lot more focus on computational efficiency and scalability of GNNs. Current GNN-based force-fields are obtaining remarkable accuracy, but are still 2–3 orders of magnitude slower than classical force-fields and are typically only deployed on a few hundred atoms.</p><p>✅ Allegro for the Gordon Bell Prize, Large-scale screening with GNoMe</p><p>🔮<strong>What to expect in 2024</strong>:</p><p><strong>1️⃣ </strong>More deployment of ML potentials into large-scale MD simulations that showcase new research opportunities and challenges and provide a better idea of what benefits ML potentials provide compared to traditional potentials.</p><p><strong>2️⃣ </strong>New datasets that outline previously unexplored challenges for ML potentials, such as new materials systems and new physical phenomena for those materials such as phase changes at various temperatures and pressures.</p><p><strong>3️⃣ </strong>Exploration of multi-scale problems that might draw inspiration from classical techniques.</p><h3>Geometric Generative Models (Manifolds)</h3><p><em>Joey Bose (Mila &amp; Dreamfold) and Alex Tong (Mila &amp; Dreamfold)</em></p><p>While generative ML continued to dominate the field in 2023, it was the popularization of geometric generative models that incorporate geometric priors an interesting trend of the year.</p><p><strong><em>Joey Bose (Mila &amp; Dreamfold)</em></strong></p><blockquote>“This year we saw the burgeoning subfield of geometric generative generative models really take a commanding step forward. With the success of diffusion models and flow matching in images we saw more fundamental contributions to enable Generative AI for geometric data types.“ — Joey Bose (Mila &amp; Dreamfold)</blockquote><p>While diffusion models for manifolds existed, this year we really saw them being scaled up with <strong>Scaling Riemannian Diffusion Models</strong> by <a href="https://scholar.google.com/citations?view_op=view_citation&amp;hl=en&amp;user=54-actIAAAAJ&amp;sortby=pubdate&amp;citation_for_view=54-actIAAAAJ:_FxGoFyzp5QC">Lou et. al</a> and functional approaches in <strong>Manifold Diffusion Fields</strong> <a href="https://arxiv.org/abs/2305.15586">Elhag et. al.</a></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*SRn5B9QlL59Gy86O" /><figcaption>(Left) Visual depiction of a training iteration for a field on the bunny manifold M. (Right) Visual depiction of the sampling process for a field on the bunny manifold. Figure source: <a href="https://arxiv.org/abs/2305.15586">Elhag et al.</a></figcaption></figure><p>For Normalizing flow-based methods, <strong>Riemannian Flow matching</strong> by <a href="https://arxiv.org/abs/2302.03660">Chen and Lipman</a> stands at the top of the sea of papers as being the most general framework for FM.</p><p>In general, a large theme of geometric generative models involves handling symmetries. Equivariant approaches shone this year, from SE(3) models including <strong>EDGI</strong> (<a href="https://arxiv.org/abs/2303.12410">Brehmer, Bose et. al</a>), <strong>SE(3) augmented coupling flows</strong> (<a href="https://arxiv.org/abs/2308.10364">Midgley et. al</a>), to cool theoretical work on <strong>Geometric neural diffusion processes</strong> (<a href="https://arxiv.org/abs/2307.05431">Mathieu et. al</a>) and important physics-based applications with the paper by <a href="https://arxiv.org/abs/2305.02402">Abbot et. al</a>.</p><p><strong><em>Alex Tong (Mila &amp; Dreamfold)</em></strong></p><blockquote>“In 2023 we saw advancement both in terms of modelling and the rise of a new application — Protein backbone design. Much work is still needed to understand the properties of the SE(3)<em>ᴺ</em>₀ type of product manifold, where it is still unclear how to best combine modalities” — Alex Tong (Mila &amp; Dreamfold)</blockquote><p>2023 saw new models such as <a href="https://www.biorxiv.org/content/10.1101/2022.12.09.519842v1">RFDiffusion</a>, <a href="https://arxiv.org/abs/2302.02277">FrameDiff</a>, and <a href="https://arxiv.org/abs/2310.02391">FoldFlow</a> which operate over the SE(3)<em>ᴺ</em>₀ manifold of protein backbones. This presents a new challenge for geometric generative models which I think we will see significant progress in the coming year.</p><p>On the modelling side, generative modelling with flow and bridge matching models in Euclidean domains led to quick succession of Riemannian and equivariant extensions with Riemannian Flow Matching by <a href="https://arxiv.org/abs/2302.03660">Chen and Lipman</a> and Equivariant flow matching (<a href="https://arxiv.org/abs/2306.15030">Klein et al.</a>, <a href="https://arxiv.org/abs/2312.07168">Song et al.</a>) on molecule generation tasks.</p><p>🔮 <strong>What to expect in 2024</strong>:</p><p><strong>1️⃣ </strong>More exploration into modelling the SE(3)<em>ᴺ</em>₀ manifold following successes in protein backbone design.</p><p><strong>2️⃣ </strong>Further investigation and theory of how to train generative models on multimodal and product manifolds.</p><p><strong>3️⃣ </strong>Domain-specific models exploiting features of more specific manifold and equivariant structures.</p><h3>BIG Graphs, Scalability: When GNNs are too expensive</h3><p><strong><em>Anton Tsitsulin (Google)</em></strong></p><p>This year has been fruitful for large graph fans.</p><blockquote>“Learning on Very Large Graphs has always been a challenge due to the unstructured sparsity not being supported by modern accelerators, losing in the <a href="https://hardwarelottery.github.io/">hardware lottery</a>. <a href="https://cloud.google.com/blog/topics/systems/tpu-v4-enables-performance-energy-and-co2e-efficiency-gains">Tensor Processing Units</a> — you can think about them as very fast GPUs with tons (multi-terabyte) of HBM memory — were the rescue of 2023.” — <strong>Anton Tsitsulin</strong> (Google)</blockquote><p>In a KDD paper (<a href="https://arxiv.org/abs/2307.14490">Mayer et al.</a>), we showed that TPUs can solve large-scale node embedding problems more efficiently than GPU and CPU systems at a fraction of the cost. Many industrial applications of graph machine learning are fully unsupervised; there, it is hard to evaluate embedding quality. We wrote a paper (<a href="https://arxiv.org/abs/2305.16562">Tsitsulin et al.</a>) that performs <strong>unsupervised embedding analysis</strong> at scale.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*d9VT0pu8UHL5gpeE" /><figcaption>Scale of TpuGraphs compared to other graph property prediction datasets. Source: <a href="https://arxiv.org/abs/2308.13490">Phothilimthana et al.</a></figcaption></figure><p>➡️ This year, TPUs helped graph machine learning, so it was time to give back. We released a new <strong>TpuGraphs</strong> dataset (<a href="https://arxiv.org/abs/2308.13490">Phothilimthana et al.</a>) and ran a <a href="https://www.kaggle.com/competitions/predict-ai-model-runtime">Kaggle competition</a> “Google — Fast or Slow? Predict AI Model Runtime” on it that showed <a href="https://blog.research.google/2023/12/advancements-in-machine-learning-for.html">how to improve</a> learning models running on TPUs with graph machine learning. It had 792 Competitors, 616 Teams, and 10,507 Entries. The dataset provides 25x more graphs than the largest graph property prediction dataset (with comparable graph sizes), and 770x larger graphs on average compared to existing performance prediction datasets on machine learning programs. This dataset is so large, a new algorithm for doing graph-level predictions on large-scale graphs had to be developed by <a href="https://arxiv.org/abs/2305.12322">Cao et al</a>.</p><p>➡️ Large-scale graph clustering has seen significant contributions this year. A new approximation algorithm (<a href="https://arxiv.org/abs/2309.17243">Cohen-Addad et al.</a>) was proposed for correlation clustering improving the approximation factor from 1.994 to the whopping 1.73. <strong>TeraHAC </strong>(<a href="https://arxiv.org/abs/2308.03578">Dhulipala et al</a>) is a major improvement over last year’s <strong>ParHAC</strong> (that we covered in the <a href="https://medium.com/towards-data-science/graph-ml-in-2023-the-state-of-affairs-1ba920cb9232#ca19">2023 post</a>) — an approximate (1+𝝐) hierarchical agglomerative clustering algorithm for trillion-edge graphs. The largest graph used in the experiments is a massive Web-Query graph with 31B nodes and 8.6 trillion edges 👀. Notable mentions also go to the fastest (to date) algorithm for Euclidean minimum spanning tree (<a href="https://arxiv.org/abs/2308.00503">Jayaram et al</a>) and a new near-linear time algorithm for approximating the Chamfer distance between point sets (<a href="https://arxiv.org/abs/2307.03043">Bakshi et al.</a>).</p><p>🔮 <strong>What to expect in 2024</strong>:</p><p><strong>1️⃣ </strong>Algorithmic advances will help scale other popular graph algorithms</p><p><strong>2️⃣ </strong>Novel hardware usage will help scaling up different graph models</p><p><strong>Predictions from the 2023 post</strong></p><p>(1) further reduction in compute costs and inference time for very large graphs<br>✅ We observed order-of-magnitude speedups in clustering and node embedding.</p><p>(2) Perhaps models for OGB LSC graphs could run on commodity machines instead of huge clusters?<br>❌ solid no</p><h3>Algorithmic Reasoning &amp; Alignment</h3><p><em>Petar Veličković (Google DeepMind) and Liudmila Prokhorenkova (Yandex Research)</em></p><p>Algorithmic reasoning, a class of ML techniques able to execute algorithmic computation, has continued to make stable progress during 2023.</p><p><strong><em>Petar Veličković (Google DeepMind)</em></strong></p><blockquote>“2023 has been a year of steady progress for neural algorithmic reasoning models — it indeed remains one of the areas where GNN development gets most creative — probably because it has to be.” — <strong>Petar Veličković </strong>(Google DeepMind)</blockquote><p>Aside from the already discussed <a href="https://openreview.net/forum?id=ba4bbZ4KoF">asynchronous algorithmic alignment</a> work, there are three results we achieved this year that I am personally proudest of:</p><p>1️⃣ <a href="https://openreview.net/forum?id=tRP0Ydz5nN">DAR</a> showed that pre-trained multi-task neural algorithmic reasoners can be scalably deployed to downstream graph problems — even if they are 180,000x larger than the synthetic training distribution of the NAR. What’s more, we set the state-of-the-art in modelling mouse brain vessels 🐁🧠🩸. NAR is <strong>not</strong> a victim of the bitter lesson! 📈</p><p>2️⃣ <a href="https://openreview.net/forum?id=kP2p67F4G7">Hint-ReLIC</a> 🗿was our response to the rich body of research in <a href="https://openreview.net/forum?id=xkrtvHlp3P">no-hint models</a>. We go away from the issue-ridden <em>hint</em> <em>autoregression</em> and instead model <em>hint invariants</em> using causal reasoning. We obtain a potent hint-based NAR, which still holds state-of-the-art on broad patches of CLRS-30! <em>“Hints can take you a long way, if used in the right way.”</em></p><p>3️⃣ Last but not least, we took the plunge and made the first in-depth analysis of the <a href="https://openreview.net/forum?id=tRP0Ydz5nN">latent space representations of trained NAR models</a>. What we found was not only immensely beautiful to look at 🌺 but it also taught us a great deal about how these models work.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*c8vWpt6GPQmDvYVvPE_11w.png" /><figcaption>Left: Trajectory-wise PCA of eight clusters of reweighted graphs showing that they all contain a single dominant direction. Different clusters have different colors. Middle: Many embedding clusters with dominant directions overlaid in red. Right: Step-wise PCA of random graphs with the dominant cluster directions overlaid in red. Source: <a href="https://openreview.net/forum?id=tRP0Ydz5nN">Mirjanić, Pascanu, Veličković</a></figcaption></figure><p>Beyond growing our vibrant community, I find it important to state that many of NAR’s foundational ideas are at the crux of important LLM methodologies; to name just one example, hint following is directly related to <a href="https://arxiv.org/abs/2201.11903">chain-of-thought</a> prompting.</p><p>💡 What I am most happy about is that in 2023, this link is getting explicit recognition, and ideas from NAR are now directly or indirectly influencing the most potent AI systems in use today. Indeed, NAR is listed as a key motivation for studying <a href="https://arxiv.org/abs/2310.16028">length generalisation</a>, and more broadly <a href="https://arxiv.org/abs/2301.13105">generalisation on the unseen</a><em> (ICML’23 Best Paper Award)</em>. CLRS-30, the flagship NAR benchmark, is directly used to evaluate capabilities of LLMs in <a href="https://arxiv.org/abs/2302.14838">neural architecture search</a> and <a href="https://arxiv.org/abs/2310.03302">general AI research</a>. And, as a final cherry on top, CLRS-30 is recognised as one of only seven reasoning evaluations used by <a href="https://arxiv.org/abs/2312.11805">Gemini</a>, a frontier large language model from Google DeepMind. I am hopeful that this is a beacon of things to come in 2024, and that we will see even more ideas from NAR break into the design of frontier scalable AI models.</p><p><strong><em>Liudmila Prokhorenkova (Yandex Research)</em></strong></p><p>Throughout the year, substantial progress has been achieved on the path towards endowing models with various algorithmic inductive biases: the use of dual problems <a href="https://arxiv.org/abs/2302.04496">(Numeroso et al)</a>, contrastive learning techniques (<a href="https://arxiv.org/abs/2302.10258">Bevilacqua et al</a>; <a href="https://arxiv.org/abs/2306.13411">Rodionov et al</a>), augmentation of models with data structures (<a href="https://arxiv.org/abs/2307.00337">Jürß et al</a>; <a href="https://arxiv.org/abs/2307.09660">Jain et a</a>l), and in-depth examination of computational models <a href="https://arxiv.org/abs/2307.04049">(Engelmayer et al)</a>. Another important direction is evaluating existing models in terms of scalability and data diversity <a href="https://arxiv.org/abs/2309.12253">(Minder et al)</a>.</p><blockquote>“In 2024 it would be great to see more comprehensive analysis and understanding of neural reasoners: which operations they learn, how sensitive they are to different shifts in data distributions, what types of mistakes they tend to make and why.” — <strong>Liudmila Prokhorenkova </strong>(Yandex Research)</blockquote><p>Gaining such insights may contribute to the development of even more robust and scalable models. Furthermore, robust neural reasoners have the potential to positively impact combinatorial optimization models.</p><p><strong>Predictions from the 2023 post</strong></p><p>(1) Algorithmic reasoning tasks are likely to scale to graphs of thousands of nodes and practical applications like in code analysis or databases<br>✅ yes, <a href="https://openreview.net/forum?id=tRP0Ydz5nN">DAR</a> scales to the OGB vessel size</p><p>(2) even more algorithms in the benchmark<br>✅ yes, <a href="https://arxiv.org/abs/2309.12253">SALSA-CLRS</a></p><p>(3) most unlikely — there will appear a model capable of solving quickselect<br>❌ still unsolved ;(</p><h3>Knowledge Graphs: Inductive Reasoning is Solved?</h3><p><em>Michael Galkin (Intel) and Zhaocheng Zhu (Mila &amp; Google)</em></p><p>Since its inception in 2011, the grand challenge of KG representation learning was truly inductive reasoning when a <strong>single</strong> model would be able to run inference (eg, missing link prediction) on any graph without input features and without learning hard-coded entity/relation embedding matrices. <a href="https://arxiv.org/abs/1911.06962">GraIL</a> (ICML’20) and <a href="https://arxiv.org/abs/2106.06935">Neural Bellman-Ford Nets</a> (NeurIPS’21) were instrumental in extending inference to unseen entities, but generalization to both new entities and relation types at inference time remained an unsolved challenge due to the main question: what can be learned and transferred when the whole entity/relation vocabulary can change?</p><p>🔮 Our prediction for 2023 (an inductive model fully transferable to different KGs with new sets of entities and relations, e.g., training on Wikidata, and running inference on DBpedia or Freebase) came true in several works:</p><ul><li><a href="https://arxiv.org/abs/2302.01313">Gao et al</a> introduced the concept of double equivariance that forces the neural net to be equivariant to permutations of both node IDs and relation IDs. The proposed ISDEA++ model employs a <a href="https://arxiv.org/abs/2110.02910">DSS-GNN</a>-like aggregation of a relation-induced subgraph and a subgraph induced by all other relation types.</li><li><a href="https://github.com/DeepGraphLearning/ULTRA">ULTRA</a> introduced by <a href="https://arxiv.org/abs/2310.04562">Galkin et al</a> learns the invariance of relation interactions (captured by a graph of relations) and transfers to absolutely any multi-relational graph. ULTRA achieves SOTA results on dozens of transductive and inductive datasets even in the zero-shot inference setup. Besides, it enables a foundation model-like approach for KG reasoning with generic pre-training, zero-shot inference, and task-specific fine-tuning.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*I7ppzlhNllzuqFmj" /><figcaption>Three main steps taken by ULTRA: (1) building a relation graph; (2) running conditional message passing over the relation graph to get relative relation representations; (3) use those representations for inductive link predictor GNN on the entity level. Source: <a href="https://arxiv.org/abs/2310.04562">Galkin et al</a></figcaption></figure><p>Learn more about inductive reasoning in the recent blog post:</p><p><a href="https://towardsdatascience.com/ultra-foundation-models-for-knowledge-graph-reasoning-9f8f4a0d7f09">ULTRA: Foundation Models for Knowledge Graph Reasoning</a></p><p>As the grand challenge seems to be solved now, is there anything left for KG research, or we should call it a day, throw a party, and move on?</p><p><strong><em>Michael Galkin (Intel)</em></strong></p><blockquote>“Indeed, with the grand challenge solved, it feels a bit like an existential crisis — everything important is invented, Graph ML enabled things that looked impossible just 5 years ago. Perhaps, KG community should re-invent itself and focus on practical problems that can be tackled with graph foundation models. Otherwise, the subfield would disappear from research radars like Semantic Web” — Michael Galkin (Intel)</blockquote><p>Transductive and shallow KG embeddings are dead and nobody in 2024 should work on them, it is time to retire them for good. ULTRA-like foundation models can now work without training on any graph which is a sweet spot for many closed enterprise KGs.</p><p>➡️ The last uncharted territory is inductive reasoning beyond simple link prediction (<a href="https://medium.com/towards-data-science/neural-graph-databases-cc35c9e1d04f">complex database-like logical queries</a>) and I think it will also be solved in 2024. Adding temporal aspects, LLM node features, or scaling GNNs for larger graphs is a question of time and presents more of an engineering task than a research question.</p><p><strong><em>Zhaocheng Zhu (Mila &amp; Google)</em></strong></p><blockquote>“With the rise of LLMs and numerous prompt-based reasoning techniques, it looks like <strong>KG reasoning is coming to an end</strong>. Texts are more expressive and flexible than KGs, and meanwhile they are more available in quantity. However, I don’t think the reasoning techniques that the KG community developed are in vain.” — Zhaocheng Zhu (Mila &amp; Google)</blockquote><p>➡️ We see that many LLM reasoning methods coincide with well-known ideas on KGs. For instance, the difference between direct prompting and chain-of-thought (CoT) shares much spirit with embedding methods and path-based methods on KGs, where the latter ones parameterize smaller steps and thereby generalize better to new combinations of steps. In fact, topics like inductive and multi-step generalization were explored on KGs several years earlier than on LLMs.</p><p>When we develop new techniques for LLMs, it is essential to take a glance at similar goals and solutions on KGs. In brief, while the modality of KGs <em>may fade at some point</em>, the insights we learned from KG reasoning will continue to illuminate in the era of LLMs.</p><h3>Temporal Graph Learning</h3><p>Shenyang Huang, Emanuele Rossi, Andrea Cini, Ingo Scholtes, and Michael Galkin prepared a separate overview post on temporal graph learning!</p><p><a href="https://towardsdatascience.com/temporal-graph-learning-in-2024-feaa9371b8e2">Temporal Graph Learning in 2024</a></p><h3>LLMs + Graphs for Scientific Discovery</h3><p><em>Michael Galkin (Intel)</em></p><p>💡LLMs were everywhere in 2023 and it’s hard to miss the 🐘 in the room.</p><blockquote>“We have seen a flurry of approaches trying to marry graphs with LLMs. The subfield is emerging and <strong>making its tiny baby steps</strong> which are important to acknowledge.” — Michael Galkin (Intel)</blockquote><p>We have seen a flurry of approaches trying to marry graphs with LLMs (sometimes literally verbalizing the edges in a text prompt) where straightforward prompting with edge index does not really work for running graph algorithms with language models, so the crux is in the “text linearization” and proper prompting. Among the notable mentions, you might be interested in <strong>GraphText</strong> by <a href="https://arxiv.org/abs/2310.01089">Zhao et al</a> that devises a <em>graph syntax tree</em> prompt constructed from features and labels in the ego-subgraph of a target node — GraphText works for node classification. In <strong>Talk Like a Graph</strong> by <a href="https://arxiv.org/abs/2310.04560">Fatemi et al</a> the authors study graph linearization strategies and how they impact LLM performance on basic tasks like edge existence, node count, or cycle check.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*fzM6yf61zDMpGqLE" /><figcaption>Standard GNNs (left) and GraphText (right). GraphText encodes the graph information into text sequences and uses LLM to perform inference. The graph-syntax tree contains both node attributes (e.g. feature and label) and relationships (e.g. center-node, 1st-hop, and 2nd-hop). Source: <a href="https://arxiv.org/abs/2310.01089">Zhao et al</a></figcaption></figure><p>➡️ Despite the early stage, there exist already 3 recent surveys (<a href="https://arxiv.org/abs/2311.12399">Li et al</a>, <a href="https://arxiv.org/abs/2312.02783">Jin et al</a>, <a href="https://arxiv.org/abs/2311.16534">Sun et al</a>) covering dozens of prompting approaches for graphs. Generally, it is yet to be seen <strong>whether</strong> <strong>LLMs are an appropriate hammer</strong> 🔨 for a specific <em>graph</em> nail given all the limitations of the autoregressive decoding, small context sizes, and permutation-invariant nature of graph tasks. If you are broadly interested in LLM reasoning, check out <a href="https://towardsdatascience.com/solving-reasoning-problems-with-llms-in-2023-6643bdfd606d">our recent blog post</a> covering the main areas and progress made in 2023.</p><p>➡️ LLMs in applied scientific tasks exhibit more promising, sometimes quite unexpected results: <strong>ChemCrow</strong> 🐦‍⬛ by <a href="https://arxiv.org/abs/2304.05376">Bran, Cox, et al</a> is an LLM agent powered with tools that can perform tasks in organic chemistry, synthesis, and material design right in natural language (without fancy equivariant GNNs). For example, with a query “<em>Find and synthesize a thiourea organocatalyst which accelerates a Diels-Alder reaction</em>” ChemCrow devises a sequence of actions starting from a basic SMILES string and ending up with instructions to a synthesis platform.</p><p>Similarly, <a href="https://openreview.net/forum?id=0r5DE2ZSwJ">Gruver et al</a> fine-tuned LLaMA-2 to generate 3D crystal structures as a plain text file with lattice parameters, atomic composition, and 3D coordinates and it is surprisingly competitive with SOTA geometric diffusion models like CDVAE.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*ruE6PRORCjm8roCFA5h8fw.png" /><figcaption>Experimental validation. a) Example of the script run by a user to initiate ChemCrow. b) Query and synthesis of a thiourea organocatalyst. c) The IBM Research RoboRXN synthesis platform on which the experiments were executed (pictures reprinted courtesy of International Business Machines Corporation). d) Experimentally validated compounds. Source: <a href="https://arxiv.org/abs/2304.05376">Bran, Cox, et al</a></figcaption></figure><p>🔮 In 2024, scientific applications of LLMs are likely to expand both breadth-wise and depth-wise:</p><p>1️⃣ Reaching out to more AI4Science areas;</p><p>2️⃣ Integration with geometric foundation models (since multi-modality is the main LLM focus for the coming year);</p><p>3️⃣ Hot take: LLMs will solve the <em>quickselect</em> task in the CLRS-30 benchmark before GNNs do 🔥</p><h3>Cool GNN Applications</h3><p><em>Petar Veličković (Google DeepMind)</em></p><p>In my standard deck motivating the use of GNNs to a broader audience, I rely on a usual “arsenal” slide of impactful GNN applications over the years. With 2023 being significantly marked by LLM developments, I was wondering — can I meaningfully update this slide, but only using models released this year?</p><blockquote>“It was the middle of the year back then, and already I was in for a nice surprise;<em> I did not have enough space to list all the awesome things done with GNNs!” — </em><strong>Petar Veličković </strong>(Google DeepMind)</blockquote><p>💡 While it might have gone comparatively under the radar, I confidently claim that 2023 was the <strong>most exciting year</strong> for cool GNN applications! The rise of LLMs just made it very clear where the limits of text-based autoregressive models are, and that for most scientific problems coming from Nature, their graph structure cannot be ignored.</p><p>Here’s a handful of my personal favourite landmark results — all published in top-tier venues:</p><ul><li><a href="https://www.science.org/doi/10.1126/science.adi2336">GraphCast</a> provided us a landmark model for medium-range global weather forecasting ⛈️ and with it, more accurate foreshadowing of extreme events such as hurricanes. A highly well-deserved cover of Science!</li><li>In an outstanding development in materials science, <a href="https://www.nature.com/articles/s41586-023-06735-9">GNoME</a> uses a GNN-based model to discover <em>millions </em>of novel crystal structures 💎 — an <em>“order-of-magnitude expansion in stable materials known to humanity”</em>. Published in Nature.</li><li>We’ve been treated to not just <a href="https://www.nature.com/articles/s41589-023-01349-8">one</a>, but <a href="https://www.nature.com/articles/s41586-023-06887-8">two</a> new breakthroughs in antibiotic discovery 💊 using message passing neural networks — the latter being published in Nature!</li><li><a href="https://www.science.org/doi/10.1126/science.ade4401">GNNs can smell</a> 👃 by observing the molecular structure emitting an odour — a result that may well revolutionise many industries, including perfumes! Published in Science.</li><li>On the cover of Nature Machine Intelligence, <a href="https://www.nature.com/articles/s42256-023-00684-8">HYFA</a> 🍄 shows how to use hypergraph factorisation to make significant progress in gene expression imputation 🧬!</li><li>Last but not least, particle physics ⚛️ remains a natural stronghold of GNN applications. In this year’s Nature Physics Review, we have been treated to a <a href="https://www.nature.com/articles/s42254-023-00569-0">fascinating survey</a> elucidating the myriad of ways how graph neural networks are deployed for various data analysis tasks at the Large Hadron Collider ⚡.</li></ul><p>⚽ My own humble contribution to the space of GNN applications this year was <a href="https://arxiv.org/abs/2310.10553">TacticAI</a>, the <em>first full AI system giving useful tactical suggestions to (association) football coaches</em>, developed in partnership with our collaborators at Liverpool FC 🔴. TacticAI is capable of both predictive modelling (<em>“what will happen in this tactical scenario?”</em>), retrieving similar tactics, and conditional generative modelling (<em>“how to modify player positions to make a particular outcome happen?”</em>). In my opinion, the most satisfying part of this very fun collaboration was our user study with some of LFC’s top coaching staff — directly illustrating that the outputs of our model will be of use to coaches in their work 🏃.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*SjqNMqxXmASGyOnnG7HHtw.png" /><figcaption>A “bird’s eye” overview of TacticAI. (A), how corner kick situations are converted to a graph representation. Each player is treated as a node in a graph, with node, edge and graph features extracted as detailed in the main text. Then, a graph neural network operates over this graph by performing message passing; each node’s representation is updated using the messages sent to it from its neighbouring nodes. (B), how TacticAI processes a given corner kick. To ensure that TacticAI’s answers are robust in the face of horizontal or vertical reflections, all possible combinations of reflections are applied to the input corner, and these four views are then fed to the core TacticAI model, where they are able to interact with each other to compute the final player representations — each “internal blue arrow” corresponds to a single message passing layer from (A). Once player representations are computed, they can be used to predict the corner’s receiver, whether a shot has been taken, as well as assistive adjustments to player positions and velocities, which increase or decrease the probability of a shot being taken. Source: <a href="https://arxiv.org/abs/2310.10553">Wang, Veličković, Hennes et al.</a></figcaption></figure><p>This is what I’m all about — AI systems that significantly augment human abilities. I can only hope that, in my home country, Partizan catches on to these methods before Red Star does! 😅</p><p>🔮 What will we see in 2024? Probably more of the same, just accelerated! ⏩</p><h3>Geometric Wall Street Bulletin 💸</h3><p><em>Nathan Benaich (AirStreet Capital)</em><strong><em>, </em></strong><em>Michael Bronstein (Oxford), and Luca Naef (VantAI)</em></p><p>2023 started with BioNTech (mostly known to the broad public for developing mRNA SARS-CoV-2 vaccines) <a href="https://www.instadeep.com/2023/01/biontech-to-acquire-instadeep-to-strengthen-pioneering-position-in-the-field-of-ai-powered-drug-discovery-design-and-development/">announcing the acquisition of InstaDeep</a>, a decade-old British company focused on AI-powered drug discovery, design and development. In May 2023, Recursion <a href="https://ir.recursion.com/news-releases/news-release-details/recursion-enters-agreements-acquire-cyclica-and-valence-bolster">acquired two startups</a>, Cyclica and Valence “to bolster chemistry and generative AI capabilities”. Valence ML team is well-known for multiple works in the geometric and graph ML and hosting the <strong>Graphs &amp; Geometry and Molecular Modeling</strong> &amp; <strong>Drug Discovery seminars</strong> on <a href="https://www.youtube.com/@valence_labs">YouTube</a>.</p><p><a href="https://apps.timwhitlock.info/emoji/tables/unicode#emoji-modal">💰</a>Isomorphic Labs started 2024 by announcing small molecule-focused <a href="https://www.isomorphiclabs.com/articles/isomorphic-labs-kicks-off-2024-with-two-pharmaceutical-collaborations">collaborations</a> with Eli Lilly and Novartis with upfront payments of $45M and $37.5M, respectively, with the potential worth of <strong>$3 billion</strong>.</p><p><a href="https://apps.timwhitlock.info/emoji/tables/unicode#emoji-modal">💰</a><a href="https://www.businesswire.com/news/home/20240108659035/en/VantAI-Secures-Renewed-Support-from-Blueprint-Medicines-to-Chart-New-Frontiers-in-Induced-Proximity-Drug-Discovery">VantAI partnered with Blueprint Medicines</a> on innovative proximity modulating therapeutics, including molecular glue and hetero-bifunctional candidates. The deal’s potential worth is $1.25 billion.</p><p><a href="https://apps.timwhitlock.info/emoji/tables/unicode#emoji-modal">💰</a>CHARM Therapeutics raised more funding <a href="https://www.businesswire.com/news/home/20230515005172/en/CHARM-Therapeutics-Receives-Investment-for-Deep-Learning-Enabled-Drug-Discovery-Research-from-NVIDIA">from NVIDIA</a> and <a href="https://www.businesswire.com/news/home/20230320005101/en/CHARM-Therapeutics-Announces-Collaboration-with-Bristol-Myers-Squibb-to-Enable-and-Accelerate-Small-Molecule-Drug-Discovery-Programs">from Bristol Myers Squibb</a> totalling the initial funding round to $70M. The company has developed DragonFold, its proprietary algorithm for protein-ligand co-folding.</p><p>💊 Monte Rosa <a href="https://ir.monterosatx.com/news-releases/news-release-details/monte-rosa-therapeutics-announces-interim-pkpd-and-clinical-data">announced a successful</a> Phase 1 study of MRT-2359 (orally bioavailable investigational molecular glue degrader) against MYC-driven tumors like lung cancer and neuroendocrine cancer. Monte Rosa is known to <a href="https://ir.monterosatx.com/static-files/8806793a-99fb-4df8-8eb7-3785b39cf210">use geometric deep learning </a>for proteins (<a href="https://www.nature.com/articles/s41592-019-0666-6">MaSIF</a>).</p><p><strong><em>Nathan Benaich (AirStreet Capital, author of </em></strong><a href="https://www.stateof.ai/"><strong><em>the State of AI Report</em></strong></a><strong><em>)</em></strong></p><blockquote>“I have long been optimistic about the potential of AI-first approaches to design problems in medicine, biotech, and materials science. Graph-based models had a great year in techbio in 2023.” — Nathan Benaich (AirStreet Capital)</blockquote><p><a href="https://www.nature.com/articles/s41586-023-06415-8">RFdiffusion</a> combines diffusion techniques with GNNs to predict protein structures. It denoises blurry or corrupted structures from the Protein Data Bank, while tapping into RoseTTAFold’s prediction capabilities. DeepMind have continued to further develop AlphaFold and build on top of it. Their <a href="https://www.science.org/doi/10.1126/science.adg7492">AlphaMissense </a>uses weak labels, language modeling, and AlphaFold to predict the pathogenicity of 71 million human variants. This is an important achievement, as most amino acid changes from genetic variation have unknown effects.</p><p>Beyond proteins, graph-based models have been improving our understanding of genetics. Stanford’s <a href="https://www.nature.com/articles/s41587-023-01905-6.pdf">GEARS</a> system integrates deep learning with a gene interaction knowledge graph to predict gene expression changes from combinatorial perturbations. By leveraging prior data on single and double perturbations, GEARS can predict outcomes for thousands of gene pairs.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*A2Ftafm8dTxwfxFV" /><figcaption>GEARS can predict new biologically meaningful phenotypes. (a) Workflow for predicting all pairwise combinatorial perturbation outcomes of a set of genes. (b) Low-dimensional representation of postperturbation gene expression for 102 one-gene perturbations and 128 two-gene perturbations used to train GEARS. A random selection is labeled. (c) GEARS predicts postperturbation gene expression for all 5,151 pairwise combinations of the 102 single genes seen experimentally perturbed. Predicted postperturbation phenotypes (non-black symbols) are often different from phenotypes seen experimentally (black symbols). Colors indicate Leiden clusters labeled using marker gene expression. Source: <a href="https://www.nature.com/articles/s41587-023-01905-6">Roohani et al</a></figcaption></figure><p>🔮 In 2024, I put hope in two different developments.</p><p><strong>1️⃣</strong> We have seen the first two CRISPR-Cas9 therapies approved in the US and the UK. These genome editors were discovered through sequencing and random experimentation. I am excited about the use of AI models to design and create bespoke editors on demand.</p><p><strong>2️⃣ </strong>We have started to see multimodality come to the AI bio world — combining DNA, RNA, protein, cellular, and imaging data to give us a more holistic understanding of biology.</p><p><strong>Companies to watch in 2024</strong></p><ul><li><a href="https://www.profluent.bio/">Profluent</a> — LLMs for protein design</li><li><a href="https://inceptive.life/">Inceptive.bio</a> — founded by one of the authors of the Transformers paper.</li><li><a href="https://www.envedabio.com/">Enveda Biosciences</a></li><li><a href="https://orbitalmaterials.com/">Orbital Materials</a></li><li><a href="https://kumo.ai/">Kumo.AI</a></li><li><a href="https://www.vant.ai/">VantAI</a> — we are biased (Michael Bronstein is Vant’s Chief Scientist and Luca Naef is a founder and CTO), but this is a cool company focused on the rational design of molecular glues using a combination of ML and proprietary experimental technology, which we believe to be the right combination for success.</li><li><a href="https://www.futurehouse.org/articles/announcing-future-house">Future House</a> — a new Silicon Valley-based non-profit company in the AI4Science space funded by ex-Google CEO Eric Schmidt. Head of Science is Andrew White, known for his works on LLMs for chemistry. The self-described mission of the company is a “moonshot to build an AI scientist.”</li></ul><p><em>For additional articles about geometric and graph deep learning, see </em><a href="https://medium.com/@mgalkin"><em>Michael Galkin</em></a><em>’s and </em><a href="https://medium.com/@michael-bronstein"><em>Michael Bronstein</em></a><em>’s Medium posts and follow the two Michaels (</em><a href="https://twitter.com/michael_galkin"><em>Galkin</em></a><em> and </em><a href="https://twitter.com/mmbronstein"><em>Bronstein</em></a><em>) on Twitter.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=1ed786f7bf63" width="1" height="1" alt=""><hr><p><a href="https://medium.com/data-science/graph-geometric-ml-in-2024-where-we-are-and-whats-next-part-ii-applications-1ed786f7bf63">Graph &amp; Geometric ML in 2024: Where We Are and What’s Next (Part II — Applications)</a> was originally published in <a href="https://medium.com/data-science">TDS Archive</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Graph & Geometric ML in 2024: Where We Are and What’s Next (Part I — Theory & Architectures)]]></title>
            <link>https://medium.com/data-science/graph-geometric-ml-in-2024-where-we-are-and-whats-next-part-i-theory-architectures-3af5d38376e1?source=rss-4d4f8ddd1e68------2</link>
            <guid isPermaLink="false">https://medium.com/p/3af5d38376e1</guid>
            <category><![CDATA[graph-machine-learning]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[geometric-deep-learning]]></category>
            <category><![CDATA[deep-dives]]></category>
            <dc:creator><![CDATA[Michael Galkin]]></dc:creator>
            <pubDate>Tue, 16 Jan 2024 00:02:09 GMT</pubDate>
            <atom:updated>2024-01-16T09:26:24.527Z</atom:updated>
            <content:encoded><![CDATA[<h4>State-of-the-Art Digest</h4><h3>Graph &amp; Geometric ML in 2024: Where We Are and What’s Next (Part I — Theory &amp; Architectures)</h3><h4>Following the tradition from previous years, we interviewed a cohort of distinguished and prolific academic and industrial experts in an attempt to summarise the highlights of the past year and predict what is in store for 2024. Past 2023 was so ripe with results that we had to break this post into two parts. This is Part I focusing on theory &amp; new architectures, see also <a href="https://medium.com/towards-data-science/graph-geometric-ml-in-2024-where-we-are-and-whats-next-part-ii-applications-1ed786f7bf63">Part II</a> on applications.</h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Lz_A1l6i036AtJ-FBFOe2w.png" /><figcaption>Image by Authors with some help from DALL-E 3.</figcaption></figure><p><em>The post is written and edited by </em><a href="https://twitter.com/michael_galkin"><em>Michael Galkin</em></a><em> and </em><a href="https://twitter.com/mmbronstein"><em>Michael Bronstein</em></a><em> with significant contributions from </em><a href="https://twitter.com/jo_brandstetter"><em>Johannes Brandstetter</em></a><em>, </em><a href="https://twitter.com/ismaililkanc/"><em>İsmail İlkan Ceylan</em></a><em>, </em><a href="https://twitter.com/Francesco_dgv"><em>Francesco Di Giovanni</em></a><em>, </em><a href="https://twitter.com/benfinkelshtein"><em>Ben Finkelshtein</em></a><em>, </em><a href="https://twitter.com/KexinHuang5"><em>Kexin Huang</em></a><em>, </em><a href="https://twitter.com/chaitjo"><em>Chaitanya Joshi</em></a><em>, </em><a href="https://twitter.com/WillLin1028"><em>Chen Lin</em></a><em>, </em><a href="https://twitter.com/chrsmrrs"><em>Christopher Morris</em></a><em>, </em><a href="https://twitter.com/mathildepapillo"><em>Mathilde Papillon</em></a><em>, </em><a href="https://twitter.com/LProkhorenkova"><em>Liudmila Prokhorenkova</em></a><em>, </em><a href="https://twitter.com/Pseudomanifold"><em>Bastian Rieck</em></a><em>, </em><a href="https://twitter.com/djjruhe"><em>David Ruhe</em></a><em>, </em><a href="https://twitter.com/HannesStaerk"><em>Hannes Stärk</em></a><em>, and </em><a href="https://twitter.com/PetarV_93"><em>Petar Veličković</em></a><em>.</em></p><ol><li><a href="#79aa">Theory of Graph Neural Networks</a><br>1. <a href="#5903">Message passing neural networks and Graph Transformers</a><br>2. <a href="#a6d7">Graph components, biconnectivity &amp; planarity</a><br>3. <a href="#27e6">Aggregation functions &amp; uniform expressivity</a> <br>4. <a href="#645f">Convergence &amp; zero-one laws of GNNs</a><br>5.<a href="#c8ac"> Descriptive complexity of GNNs</a><br>6. <a href="#9b59">Fine-grained expressivity of GNNs</a><br>7. <a href="#06c2">Expressivity results for Subgraph GNNs</a><br>8. <a href="#ab19">Expressivity for Link Prediction and Knowledge Graphs</a><br>9. <a href="#c284">Over-squashing &amp; Expressivity</a><br>10. <a href="#4a32">Generalization and Extrapolation capabilities of GNNs</a><br>11. <a href="#4f30">Predictions time!</a></li><li><a href="#b09f">New and Exotic Message Passing</a></li><li><a href="#9a3b">Beyond Graphs</a><br>1. <a href="#efa6">Topology</a><br>2. <a href="#a368">Geometric Algebras</a><br>3. <a href="#5b67">PDEs</a></li><li><a href="#8171">Robustness &amp; Explainability</a></li><li><a href="#e7b4">Graph Transformers</a></li><li><a href="#cf16">New Datasets &amp; Benchmarks</a></li><li><a href="#926c">Conferences, Courses &amp; Community</a></li><li><a href="#f1d3">Memes of 2023</a></li></ol><p>The legend we will be using throughout the text:<br>💡 - year’s highlight<br>🏋️ - challenges <br> ➡️ - current/next developments<br>🔮- predictions/speculations</p><h3>Theory of Graph Neural Networks</h3><p><em>Michael Bronstein (Oxford), Francesco Di Giovanni (Oxford), İsmail İlkan Ceylan (Oxford), Chris Morris (RWTH Aachen)</em></p><h4><strong>Message Passing Neural Networks &amp; Graph Transformers</strong></h4><p>Graph Transformers are a relatively recent trend in graph ML, trying to extend the successes of Transformers from sequences to graphs. As far as traditional expressivity results go, these architectures do not offer any particular advantages. In fact, it is arguable that most of their benefits in terms of expressivity (see e.g. <a href="https://arxiv.org/abs/2106.03893">Kreuzer et al.</a>) come from powerful structural encodings rather than the architecture itself and such encodings can in principle be used with MPNNs.</p><p>In a recent paper, <a href="https://arxiv.org/abs/2301.11956">Cai et al. </a>investigate the connection between MPNNs and (graph) Transformers showing that an MPNN with a virtual node — an auxiliary node that is connected to all other nodes in a specific way — can simulate a (graph) Transformer. This architecture is<em> non-uniform</em>, i.e., the size and structure of the neural networks may depend on the size of the input graphs. Interestingly, once we restrict our attention to linear Transformers (e.g., Performer) then there is a <em>uniform</em> result: there exists a single MPNN using a virtual node that can approximate a linear transformer such as Performer on any input of any size.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*ZL9GXms6ewqatSrD" /><figcaption>Figure from <a href="https://arxiv.org/abs/2301.11956">Cai et al.</a>: (a) MPNN with a virtual node, (b) a Transformer.</figcaption></figure><p>This is related to the discussions on whether graph transformer architectures present advantages for capturing long-range dependencies when compared to MPNNs. Graph transformers are compared to MPNNs that include a global computation component through the use of virtual nodes, which is a common practice. <a href="https://arxiv.org/abs/2301.11956">Cai et al.</a> empirically show that MPNNs with virtual nodes can surpass the performance of graph transformers on the Long-Range Graph Benchmark (LRGB, <a href="https://arxiv.org/abs/2206.08164">Dwivedi et al.</a>) Moreover, <a href="https://arxiv.org/abs/2309.00367">Tönshoff et al.</a> re-evaluate MPNN baselines on the LRGB benchmark to find out that the earlier reported performance gap in favor of graph transformers was overestimated due to suboptimal hyperparameter choices, essentially closing the gap between MPNNs and graph Transformers.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*kpj2Y7L_oohN34LX" /><figcaption>Figure from <a href="https://arxiv.org/abs/2202.13013">Lim et al.</a>: SignNet pipeline.</figcaption></figure><p>It is also well-known that common Laplacian positional encodings (e.g., LapPE), are not invariant to the changes of signs and basis of eigenvectors. The lack of invariance makes it easier to obtain (non-uniform) universality results, but these models do not compute graph invariants as a consequence. This has motivated a body of work this year, including the study of sign and basis invariant networks (<a href="https://arxiv.org/abs/2202.13013">Lim et al., 2023a</a>) and sign equivariant networks (<a href="https://arxiv.org/abs/2312.02339">Lim et al., 2023b</a>). These findings suggest that more research is necessary to theoretically ground the claims commonly found in the literature regarding the comparisons of MPNNs and graph transformers.</p><h4><strong>Graph components, biconnectivity, and planarity</strong></h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*m6EY4_Z792AC0Gyc" /><figcaption>Figure originally by Zyqqh at <a href="https://commons.wikimedia.org/w/index.php?curid=19053091">Wikipedia</a>.</figcaption></figure><p><a href="https://arxiv.org/abs/2301.09505">Zhang et al. (2023a)</a> brings the study of graph biconnectivity to the attention of graph ML community. There are many results presented by <a href="https://arxiv.org/abs/2301.09505">Zhang et al. (2023a)</a> relative to different biconnectivity metrics. It has been shown that standard MPNNs cannot detect graph biconnectivity unlike many existing higher-order models (i.e., those that can match the power of 2-FWL). On the other hand, Graphormers with certain distance encodings and subgraph GNNs such as ESAN can detect graph biconnectivity.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*870VMUW2Vna8nacy" /><figcaption>Figure from <a href="https://arxiv.org/abs/2307.01180">Dimitrov et al. (2023)</a>: LHS shows the graph decompositions (A-C) and RHS shows the associated encoders (D-F) and the update equation (G).</figcaption></figure><p><a href="https://arxiv.org/abs/2307.01180">Dimitrov et al. (2023)</a> rely on graph decompositions to develop dedicated architectures for learning with planar graphs. The idea is to align with a variation of the classical <a href="https://www.sciencedirect.com/science/article/pii/0020019071900196">Hopcroft &amp; Tarjan</a> algorithm for planar isomorphism testing. <a href="https://arxiv.org/abs/2307.01180">Dimitrov et al. (2023)</a> first decompose the graph into its biconnected and triconnected components, and afterwards learn representations for nodes, cut nodes, biconnected components, and triconnected components. This is achieved using the classical structures of Block-Cut Trees and SPQR Trees which can be computed in linear time. The resulting framework is called <a href="https://arxiv.org/abs/2307.01180">PlanE</a> and contains architectures such as <a href="https://arxiv.org/abs/2307.01180">BasePlanE</a>. BasePlanE computes <em>isomorphism-complete graph invariants</em> and hence it can distinguish any pair of planar graphs. The key contribution of this work is to design architectures for efficiently learning complete invariants of planar graphs while remaining practically scalable. It is worth noting that 3-FWL is known to be complete on planar graphs (<a href="https://dl.acm.org/doi/10.1145/3333003">Kiefer et al., 2019</a>), but this algorithm is not scalable.</p><h4><strong>Aggregation functions: A uniform expressiveness study</strong></h4><p>It was broadly argued that different aggregation functions have their place, but this had not been rigorously proven. In fact, in the non-uniform setup, sum aggregation with MLPs yields an injective mapping and as a result subsumes other aggregation functions (<a href="https://arxiv.org/abs/1810.00826">Xu et al., 2020</a>), which builds on earlier results (<a href="https://arxiv.org/abs/1703.06114">Zaheer et al., 2017</a>). The situation is different in the uniform setup, where one fixed model is required to work on <em>all</em> graphs. <a href="https://arxiv.org/abs/2302.11603">Rosenbluth et al. (2023)</a> show that sum aggregation does not always subsume other aggregations in the uniform setup. If, for example, we consider an unbounded feature domain, sum aggregation networks cannot even approximate mean aggregation networks. Interestingly, even for the positive results, where sum aggregation is shown to approximate other aggregations, the presented constructions generally require a large number of layers (growing with the inverse of the approximation error).</p><h4><strong>Convergence and zero-one laws of GNNs on random graphs</strong></h4><p>GNNs can in principle be applied to graphs of any size following training. This makes an asymptotic analysis in the size of the input graphs very appealing. Previous studies of the asymptotic behaviour of GNNs have focused on convergence to theoretical limit networks (<a href="https://arxiv.org/abs/2006.01868">Keriven et al., 2020</a>) and their stability under the perturbation of large graphs (<a href="https://arxiv.org/abs/1907.12972">Levie et al., 2021</a>).</p><p>In a recent study, <a href="https://arxiv.org/abs/2301.13060">Adam-Day et al. (2023</a>) proved a <em>zero-one law</em> for binary GNN classifiers. The question being tackled is the following: How do binary GNN classifiers behave as we draw Erdos-Rényi graphs of increasing size with random node features? The main finding is that the probability that such graphs are mapped to a particular output by a class of GNN classifiers tends to either zero or to one. That is, the model eventually maps either <em>all</em> graphs to zero or <em>all</em> graphs to one. This result applies to GCNs as well as to GNNs with sum and mean aggregation.</p><p>The principal import of this result is that it establishes a novel <em>uniform</em> upper bound on the expressive power of GNNs: any property of graphs which can be uniformly expressed by these GNN architectures must obey a zero-one law. An example of a simple property which does not asymptotically tend to zero or one is that of having an even number of nodes.</p><h4><strong>The descriptive complexity of GNNs</strong></h4><p><a href="https://arxiv.org/abs/2303.04613">Grohe (2023)</a> recently analysed the descriptive complexity of GNNs in terms of Boolean circuit complexity. The specific circuit complexity class of interest is TC0. This class contains all languages which are decided by Boolean circuits with constant depth and polynomial size, using only AND, OR, NOT, and threshold<a href="https://en.wikipedia.org/wiki/Majority_gate"> </a>(or, majority) gates. <a href="https://arxiv.org/abs/2303.04613">Grohe (2023)</a> proves that the graph functions that can be computed by a class of polynomial-size bounded-depth family of GNNs lie in the circuit complexity class TC0. Furthermore, if the class of GNNs are allowed to use random node initialization and global readout as in <a href="https://arxiv.org/abs/2010.01179">Abboud el al. (2020)</a> then there is a matching lower bound in that they can compute exactly the same functions that can be expressed in TC0. This establishes an upper bound on the power of GNNs with random node features, by requiring the class of models to be of bounded depth (fixed #layers) and of size polynomial. While this result is still non-uniform, it improves the result of <a href="https://arxiv.org/abs/2010.01179">Abboud el al. (2020)</a> where the construction can be worst-case exponential.</p><h4><strong>A fine-grained expressivity study of GNNs</strong></h4><p>Numerous recent works have analyzed the expressive power of MPNNs, primarily utilizing combinatorial techniques such as the 1-WL for the graph isomorphism problem. However, the graph isomorphism objective is inherently binary, not giving insights into the degree of similarity between two given graphs. <a href="https://arxiv.org/abs/2306.03698">Böker et al. (2023)</a> resolve this issue by deriving continuous extensions of both 1-WL and MPNNs to graphons. Concretely, they show that the continuous variant of 1-WL delivers an accurate topological characterization of the expressive power of MPNNs on graphons, revealing which graphs these networks can distinguish and the difficulty level in separating them. They provide a theoretical framework for graph and graphon similarity, combining various topological variants of classical characterizations of the 1-WL. In particular, they characterize the expressive power of MPNNs in terms of the tree distance, which is a graph distance based on the concept of fractional isomorphisms, and substructure counts via tree homomorphisms, showing that these concepts have the same expressive power as the 1-WL and MPNNs on graphons. Interestingly, they also validated their theoretical findings by showing that randomly initialized MPNNs, without training, show competitive performance compared to their trained counterparts.</p><h4><strong>Expressiveness results for Subgraph GNNs</strong></h4><p>Subgraph-based GNNs were already a big trend in 2022 (<a href="https://arxiv.org/abs/2110.02910">Bevilacqua et al., 2022</a>, <a href="https://arxiv.org/abs/2206.11168">Qian et al., 2022</a>). This year, <a href="https://arxiv.org/abs/2302.07090">Zhang et al. (2023b)</a> established more fine-grained expressivity results for such architectures. The paper investigates subgraph GNNs via the so-called Subgraph Weisfeiler-Leman Tests (SWL). Through this, they show a complete hierarchy of SWL with strictly growing expressivity. Concretely, they define equivalence classes for SWL-type algorithms and show that almost all existing subgraph GNNs fall in one of them. Moreover, the so-called SSWL achieves the maximal expressive power. Interestingly, they also relate SWL to several existing expressive GNNs architectures. For example, they show that SWL has the same expressivity as the local versions of 2-WL (<a href="https://arxiv.org/abs/1904.01543">Morris et al., 2020</a>). In addition to theory, they also show that SWL-type architectures achieve good empirical results.</p><h4><strong>Expressive power of architectures for link prediction on KGs</strong></h4><p>The expressive power of architectures such as RGCN and CompGCN for link prediction on knowledge graphs has been studied by <a href="https://arxiv.org/abs/2211.17113">Barceló et al. (2022)</a>. This year, <a href="https://arxiv.org/abs/2302.02209">Huang et al. (2023)</a> generalized these results to characterize the expressive power of various other model architectures.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*zk0cTSL838STb9rx" /><figcaption>Figure from <a href="https://arxiv.org/abs/2302.02209">Huang et al. (2023)</a>: The figure compares the respective mode of operations in R-MPNNs and C-MPNNs.</figcaption></figure><p><a href="https://arxiv.org/abs/2302.02209">Huang et al. (2023)</a> introduced the framework of conditional message passing networks (<a href="https://arxiv.org/abs/2302.02209">C-MPNNs</a>) which includes architectures such as <a href="https://arxiv.org/abs/2106.06935">NBFNets</a>. Classical relational message passing networks (R-MPNNs) are unary encoders (i.e., encoding graph nodes) and rely on a binary decoder for the task of link prediction (<a href="https://arxiv.org/abs/2010.16103">Zhang, 2021</a>). On the other hand, C-MPNNs serve as binary encoders (i.e., encoding pairs of graph nodes) and as a result, are more suitable for the binary task of link prediction. C-MPNNs are shown to align with a relational Weisfeiler-Leman algorithm that can be seen as a local approximation of 2WL. These findings explain the superior performance of NBFNets and alike over, e.g., RGCNs. <a href="https://arxiv.org/abs/2302.02209">Huang et al. (2023)</a> also present uniform expressiveness results in terms of precise logical characterizations for the class of binary functions captured by C-MPNNs.</p><h4><strong>Over-squashing and expressivity</strong></h4><p>Over-squashing is a phenomenon originally described by <a href="https://arxiv.org/abs/2006.05205">Alon &amp; Yahav</a> in 2021 as the compression of exponentially-growing receptive fields into fixed-size vectors. Subsequent research (<a href="https://arxiv.org/abs/2111.14522">Topping et al., 2022</a>, <a href="https://arxiv.org/abs/2302.02941">Di Giovanni et al., 2023</a>, <a href="https://arxiv.org/abs/2302.06835">Black et al., 2023</a>, <a href="https://arxiv.org/abs/2211.15779">Nguyen et al., 2023</a>) has characterised over-squashing through sensitivity analysis, proving that the dependence of the output features on hidden representations from earlier layers, is impaired by topological properties such as negative curvature or large commute time. Since the graph topology plays a crucial role in the formation of bottlenecks, <em>graph rewiring</em>, a paradigm shift elevating the graph connectivity to design factor in GNNs, has been proposed as a key strategy for alleviating over-squashing (if you are interested, see the Section on <strong>Exotic Message Passing</strong><em> </em>below).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/739/0*gACxaunkfXsbPBQ1" /><figcaption>For the given graph, the MPNN learns stronger mixing (tight springs) for nodes (v, u) and (u, w) since their commute time is small, while nodes (u, q) and (u, z), with high commute-time, have weak mixing (loose springs). Source: <a href="https://arxiv.org/abs/2306.03589">Di Giovanni et al., 2023</a></figcaption></figure><p>Over-squashing is an obstruction to the expressive power, for it causes GNNs to falter in tasks with long-range interactions. To formally study this, <a href="https://arxiv.org/abs/2306.03589">Di Giovanni et al., 2023</a> introduce a new metric of expressivity, referred to as “mixing”, which encodes the joint and nonlinear dependence of a graph function on pairs of nodes’ features: for a GNN to approximate a function with large mixing, a necessary condition is allowing “strong” message exchange between the relevant nodes. Hence, they postulate to measure over-squashing through the mixing of a GNN prediction, and prove that the depth required by a GNN to induce enough mixing, <em>as required by the task</em>, grows with the commute time — typically much worse than the shortest-path distance. The results show how over-squashing hinders the expressivity of GNNs with “practical” size, and validate that it arises from the misalignment between the task (requiring strong mixing between nodes i and j) and the topology (inducing large commute time between i and j).</p><p>The “mixing” of a function pertains to the exchange of information between nodes, whatever this information is, and not to its capacity to separate node representations. In fact, these results<a href="https://arxiv.org/abs/2306.03589"> </a>also hold for GNNs more powerful than the 1-WL test. The analysis in <a href="https://arxiv.org/abs/2306.03589">Di Giovanni et al., (2023)</a> offers an alternative approach for studying the expressivity of GNNs, which easily extends to equivariant GNNs in 3D space and their ability to model interactions between nodes.</p><h4><strong>Generalization and extrapolation capabilities of GNNs</strong></h4><p>The expressive power of MPNNs has achieved a lot of attention in recent years through its connection to the WL test. While this connection has led to significant advances in understanding and enhancing MPNNs’ expressive power (<a href="https://arxiv.org/abs/2301.11039">Morris et al, 2023a</a>), it does not provide insights into their generalization performance, i.e., their ability to make meaningful predictions beyond the training set. Surprisingly, only a few notable contributions study MPNNs’ generalization behaviors, e.g., <a href="https://arxiv.org/abs/2002.06157">Garg et al. (2020</a>), <a href="https://www.ijcai.org/proceedings/2018/0325.pdf">Kriege et al. (2018)</a>, <a href="https://arxiv.org/abs/2012.07690">Liao et al. (2021)</a>, <a href="https://arxiv.org/abs/2202.00645">Maskey et al. (2022)</a>, <a href="https://pubmed.ncbi.nlm.nih.gov/30219742/">Scarselli et al. (2018)</a>. However, these approaches express MPNNs’ generalization ability using only classical graph parameters, e.g., maximum degree, number of vertices, or edges, which cannot fully capture the complex structure of real-world graphs. Further, most approaches study generalization in the non-uniform regime, i.e., assuming that the MPNNs operate on graphs of a pre-specified order.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*HbXOxwtBNm3256l3" /><figcaption>Figure from <a href="https://arxiv.org/abs/2301.11039">Morris et al. (2023b)</a>: Overview of the generalization capabilities of MPNNs and their link to the 1-WL.</figcaption></figure><p>Hence, <a href="https://arxiv.org/abs/2301.11039">Morris et al. (2023b)</a> showed a tight connection between the expressive power of the 1-WL and generalization performance. They investigate the influence of graph structure and the parameters’ encoding lengths on MPNNs’ generalization by tightly connecting 1-WL’s expressivity and MPNNs’ Vapnik–Chervonenkis (VC) dimension. To that, they show several results.</p><p>1️⃣ First, in the non-uniform regime, they show that MPNNs’ VC dimension depends tightly on the number of equivalence classes computed by the 1-WL over a set of graphs. In addition, their results easily extend to the k-WL and many recent expressive MPNN extensions.</p><p>2️⃣ In the uniform regime, i.e., when graphs can have arbitrary order, they show that MPNNs’ VC dimension is lower and upper bounded by the largest bitlength of its weights. In both the uniform and non-uniform regimes, MPNNs’ VC dimension depends logarithmically on the number of colors computed by the 1-WL and polynomially on the number of parameters. Moreover, they also empirically show that their theoretical findings hold in practice to some extent.</p><h4>🔮 Predictions time!</h4><p><strong><em>Christopher Morris (RWTH Aachen)</em></strong></p><blockquote>“I believe that there is a pressing need for a better and more practical theory of generalization of GNNs. ” — <strong>Christopher Morris</strong> (RWTH Aachen)</blockquote><p>➡️ For example, we need to understand how graph structure and various architectural parameters influence generalization. Moreover, the dynamics of SGD for training GNNs are currently understudied and not well understood, and more works will study this.</p><p><strong><em>İsmail İlkan Ceylan (Oxford)</em></strong></p><blockquote>“I hope to see more expressivity results in the uniform setting, where we fix the parameters of a neural network and examine its capabilities.” — <strong>İsmail İlkan Ceylan </strong>(Oxford)</blockquote><p>➡️ In this case, we can identify a better connection to generalization, because if a property cannot be expressed uniformly then the model cannot generalise to larger graph sizes.</p><p>➡️ This year, we may also see expressiveness studies that target graph regression or graph generation, which remain under-explored. There are good reasons to hope for learning algorithms which are isomorphism-complete on larger graph classes, strictly generalizing the results for planar graphs.</p><p>➡️ It is also time to develop a theory for learning with fully relational data (i.e., knowledge hypergraphs), which will unlock applications in relational databases!</p><p><strong><em>Francesco Di Giovanni (Oxford)</em></strong></p><p>In terms of future theoretical developments of GNNs, I can see two directions that deserve attention.</p><blockquote>“There is very little understanding of the dynamics of the weights of a GNN under gradient flow (or SGD); assessing the impact of the graph topology on the evolution of the weights is key to addressing questions about generalisation and hardness of a task.” — Francesco Di Giovanni (Oxford)</blockquote><p>➡️ Second, I believe it would be valuable to develop alternative paradigms of expressivity, which more directly focus on approximation power (of graph functions and their derivatives) and identify precisely the tasks which are hard to learn. The latter direction could also be particularly meaningful for characterising the power of equivariant GNNs in 3D space, where measurements of expressivity might need to be decoupled from the 2D case in order to be better aligned with tasks coming from the scientific domain.</p><p>At the end: a fun fact about where WL went in 2023</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*a4IHR94Y8g8YZXIT" /><figcaption>Portraits: Ihor Gorsky</figcaption></figure><h4><strong>Predictions from the 2023 post</strong></h4><p>(1) More efforts on creating time- and memory-efficient subgraph GNNs.<br>❌ not really</p><p>(2) Better understanding of generalization of GNNs<br>✅ yes, see the subsections on oversquashing and generalization</p><p>(3) Weisfeiler and Leman visit 10 new places!<br>❌ (4 so far) <a href="https://openreview.net/forum?id=eZneJ55mRO">Grammatical</a>, <a href="https://arxiv.org/abs/2311.01205">indifferent</a>, <a href="https://arxiv.org/abs/2307.05775">measurement modeling</a>, <a href="https://arxiv.org/abs/2308.06838">paths</a></p><h3>New and exotic message passing</h3><p><em>Ben Finkelshtein (Oxford), Francesco Di Giovanni (Oxford), Petar Veličković (Google DeepMind)</em></p><p><strong><em>Petar Veličković (Google DeepMind)</em></strong></p><p>Over the years, it has become part of common folklore that the development of message passing operators has saturated. What I find particularly exciting about the progress made in 2023 is that, from several independent research groups (including our own), a unified novel direction has emerged: let’s start considering the impact of <strong><em>time</em></strong> in the GNN ⏳.</p><blockquote>“I forecast that, in 2024, time will assume a central role in the development of novel GNN architectures.” — Petar Veličković (Google DeepMind)</blockquote><p>💡 Time has already been leveraged in GNN design when it is explicitly provided in the input (in spatiotemporal or fully dynamic graphs). This year, it has started to feature in research of GNN operators on <em>static</em> graph inputs. Several works are dropping the assumption of a unified, synchronised clock ⏱️ which forces all messages in a layer to be sent and received at once.</p><p>1️⃣ The first such work, <a href="https://openreview.net/forum?id=zffXH0sEJP">GwAC</a> 🥑, only played with rudimentary randomised message scheduling, but provided <strong>proofs</strong> for why such processing might yield significant improvements in expressive power. <a href="https://arxiv.org/abs/2310.01267">Co-GNNs</a> 🤝 carry the torch further, demonstrating a more elaborate and fine-tuned message scheduling mechanism which is node-centric, allowing each node to choose when to send 📨 or receive 📬 messages. Co-GNNs also provide a practical method for training such schedulers by gradient descent. While the development of such asynchronous GNN models is highly desirable, we must also acknowledge the associated scalability issues — our present frontier hardware is not designed to efficiently scale such sequential systems.</p><p>2️⃣ In our own work on <a href="https://openreview.net/forum?id=ba4bbZ4KoF">asynchronous algorithmic alignment</a>, we instead opt to design a <em>synchronous</em> GNN, but <strong>constrain</strong> its message, aggregation, and update functions such that the GNN would yield identical embeddings even if parts of its dataflow were made asynchronous. This led us to an exciting journey through monoids, 1-cocycles, and category theory, resulting in a scalable GNN model that achieves superior performance on many CLRS-30 tasks.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*t6r827_csyRdovaiPRXNSg.png" /><figcaption>A possible execution trace of an asynchronous GNN. While traditional GNNs send and receive all messages synchronously, under our framework, at any step the GNN may choose to execute any number of possible operations (depicted here with a collection on the right side of the graph). Source: <a href="https://openreview.net/forum?id=ba4bbZ4KoF">Dudzik et al.</a></figcaption></figure><p>➡️ Lastly, it is worth noting that for certain special choices of message scheduling, we do not need to make modifications to synchronous GNNs’ architecture — and may instead resort to dynamic graph rewiring. <a href="https://arxiv.org/abs/2305.08018">DREW</a> and <a href="https://openreview.net/forum?id=lXczFIwQkv">Half-Hop</a> are two concurrently published papers at ICML’23 which embody the principle of using graph rewiring to <em>slow down</em> message passing 🐌. In DREW, a message from each node is actually sent to every other node, but it takes <em>k</em> layers before a message will reach a neighbour that is <em>k</em> hops away! Half-Hop, on the other hand, takes a more lenient approach, and just randomly decides whether or not to introduce a “slow node” which extends the path between any two nodes connected by an edge. Both approaches naturally alleviate the oversmoothing problem, as messages travelling longer distances will oversmooth less.</p><p>Whether it is used for message passing design, GNN dataflow or graph rewiring, in 2023 we have just started to grasp the importance of <em>time</em> — even when time variation is not explicitly present in our dataset.</p><p><strong><em>Ben Finkelshtein (Oxford)</em></strong></p><p>The time-dependent message passing paradigm presented in <a href="https://arxiv.org/abs/2310.01267">Co-GNNs</a> is a learnable generalisation of message passing, which allows each node to decide how to propagate information from or to its neighbours, thus enabling a more flexible flow of information. The nodes are regarded as players that can either broadcast to neighbors that listen <em>and</em> listen to neighbors that broadcast (like in classical message-passing), Broadcast to neighbors that listen, or Isolate (neither listen nor broadcast).</p><p>The interplay between these actions and the ability to change them locally and dynamically allows CoGNNs to determine a <strong>task-specific</strong> computational graph (which can be considered as a form of <strong>dynamic </strong>and <strong>directed rewiring</strong>, learn different action distribution for two nodes with different node features (both <strong>feature- </strong>and <strong>structure-based)</strong>.<strong> </strong>CoGNNs allow <strong>asynchronous</strong> updates across nodes and also yield unique node identifiers with high probability, which allows them to distinguish any pair of graphs (<strong>more expressive than 1-WL</strong>, at the expense of equivariance holding only in expectation).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*0Omx4ZqhSKJB9_ok" /><figcaption>Left to right: classical MPNNs (all nodes broadcast &amp; listen), DeepSets (all nodes isolate), and generic CoGNNs. Figure from <a href="https://towardsdatascience.com/co-operative-graph-neural-networks-34c59bf6805e?gi=98ca39c38e41">blog post</a>.</figcaption></figure><p>Check the Medium post for more details:</p><p><a href="https://towardsdatascience.com/co-operative-graph-neural-networks-34c59bf6805e">Co-operative Graph Neural Networks</a></p><p><strong><em>Francesco Di Giovanni (Oxford)</em></strong></p><blockquote>“The understanding of over-squashing, arising when the task depends on the interaction between nodes with large commute time, acted as a catalyst for the emergence of graph rewiring as a valid approach for designing new GNNs.” — <strong>Francesco Di Giovanni</strong> (Oxford)</blockquote><p>️💡 <em>Graph rewiring </em>broadly entails altering the connectivity of the input graph to facilitate the solution of the downstream task. Recently, this has often targeted bottlenecks in the graph, thereby adding (and removing) edges to improve the flow of information. While the emphasis has been on <strong>where</strong> messages are exchanged, recent works (discussed above) have shed light on the relevance of <strong>when</strong> messages should be exchanged as well. One rationale behind these approaches, albeit often implicit, is that the hidden representations built by the layers of a GNN, provide the graph with an (artificially) <em>dynamic</em> component, even though the graph and input features are static. This perspective can be leveraged in several ways.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*0b7jbYmoZGeh1_a4OywUSA.png" /><figcaption>In the classicical MPNN setting, at every layer information only travels from a node to its immediate neighbours. In DRew, the graph changes based on the layer, with newly added edges connecting nodes at distance r from layer r − 1 onward. Finally, in νDRew, we also introduce a delay mechanism equivalent to skip-connections between different nodes based on their mutual distance. Source: <a href="https://arxiv.org/abs/2305.08018">Gutteridge et al.</a></figcaption></figure><p>➡️ One framework that has particularly embraced such an angle is <a href="https://arxiv.org/abs/2305.08018"><strong>DRew</strong></a>, which extends any message-passing model in two ways: (i) it connects nodes at distance <em>r</em> directly, but only from layer <em>r</em> onwards; (ii) when nodes are connected, a delay is applied to their message exchange, based on their mutual distance. As the figure above illustrates, (i) allows the network to better retain the inductive bias, as nodes that are closer, interact <em>earlier;</em> (ii) instead acts as <em>distance-aware</em> <em>skip connections, </em>thereby facilitating the propagation of gradients for the loss. Most likely, it is for this reason, and not prevention of over-smoothing (which hardly has an impact for graph-level tasks), that the framework significantly enhances the performance of standard GNNs at larger depths (more details can be found in this <a href="https://towardsdatascience.com/dynamically-rewired-delayed-message-passing-gnns-2d5ff18687c2">blog post</a>).</p><p><strong>🔮 Predictions: </strong>I believe that the deep implications of extending message-passing over the “time” component would start to emerge in the coming year. Works like DRew have only scratched the surface of why rewiring over time (beyond space) might benefit the training of GNNs, drastically affecting their accuracy response across different depth regimes.</p><p>➡️ More broadly, I hope that theoretical and practical developments of graph rewiring could be translated into scientific domains, where equivariant GNNs are often applied to 3D problems which either do not have a natural graph structure (making the question of “where” messages should be exchanged ever more relevant) or (and) exhibit natural temporal (multi-scale) properties (making the question of “when” messages should be exchanged likely to be key for reducing memory constraints and retaining the right inductive bias).</p><h3>Geometry, Topology, Geometric Algebras &amp; PDEs</h3><p><em>Johannes Brandstetter (JKU Linz), Michael Galkin (Intel), Mathilde Papillon (UC Santa Barbara), Bastian Rieck (Helmholtz &amp; TUM), and David Ruhe (U Amsterdam)</em></p><p>2023 brought the most comprehensive introduction to (and a survey of) Geometric GNNs covering the most basic and necessary concepts with a handful of examples: <strong>A Hitchhiker’s Guide to Geometric GNNs for 3D Atomic Systems </strong>(<a href="https://arxiv.org/abs/2312.07511">Duval, Mathis, Joshi, Schmidt, et al.</a>). If you ever wanted to learn from scratch the core architectures powering recent breakthroughs of graph ML in protein design, material discovery, molecular simulations, and more — this is what you need!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*AYsGjZhbdr701OndCvnfng.png" /><figcaption>Timeline of key Geometric GNNs for 3D atomic systems, characterised by the type of intermediate representations within layers. Source: <a href="https://arxiv.org/abs/2312.07511">Duval, Mathis, Joshi, Schmidt, et al.</a></figcaption></figure><h4><strong>Topology</strong></h4><p>💡 Working with topological structures in 2023 has become much easier for both researchers and practitioners thanks to the amazing efforts of the <a href="https://github.com/pyt-team">PyT team</a> and their suite of resources: <strong>TopoNetX</strong>, <strong>TopoModelX</strong>, and <strong>TopoEmbedX</strong>. <a href="https://github.com/pyt-team/TopoNetX">TopoNetX</a> is pretty much the networkx for topological data. TopoNetX supports standard structures like cellular complexes, simplicial complexes, and combinatorial complexes. <a href="https://github.com/pyt-team/TopoModelX">TopoModelX</a> is a PyG-like library for deep learning on topological data and implements famous models like <a href="https://arxiv.org/abs/2103.03212">MPSN</a> and <a href="https://arxiv.org/abs/2106.12575">CIN</a> with a neat unified interface (the original PyG implementations are quite tangled). <a href="https://github.com/pyt-team/TopoEmbedX">TopoEmbedX</a> helps to train embedding models on topological data and supports core algorithms like <a href="https://arxiv.org/abs/2010.00743">Cell2Vec</a>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*jrYDXng6bSsL-vstRWaT-A.png" /><figcaption>Domains: Nodes in blue, (hyper)edges in pink, and faces in dark red. Source: <a href="https://github.com/pyt-team/TopoNetX">TopoNetX</a>, <a href="https://arxiv.org/abs/2304.10031">Papillon et al</a></figcaption></figure><p>💡 A great headstart to the field and basic building blocks of those topological networks are the papers by <a href="https://arxiv.org/abs/2206.00606">Hajij et al</a> and by <a href="https://arxiv.org/abs/2304.10031">Papillon et al</a>. A notable chunk of models was implemented by the members of the <a href="https://www.tagds.com/home">Topology, Algebra, and Geometry in Data Science</a> (TAG) community that regularly organizes topological workshops at ML conferences.</p><p><strong><em>Mathilde Papillon (UCSB)</em></strong></p><blockquote>“Until 2023, the field of topological deep learning featured a fractured landscape of enriched representations for relational data.” — Mathilde Papillon (UC Santa Barbara)</blockquote><p>➡️ Message-passing models were only built upon and benchmarked against other models of the same domain, e.g., the simplicial complex community remained insular to the hypergraph community. To make matters worse, most models adopted a unique mathematical notation. Deciding which model would be best suited to a given application seemed like a monumental task. A unification theory proposed by <a href="https://arxiv.org/abs/2206.00606">Hajij et al</a> offered a general scheme under which all models could be systematically described and classified. We applied this theory to the literature to produce a comprehensive yet concise <a href="https://arxiv.org/abs/2304.10031">survey of message passing in topological deep learning</a> that also serves as an accessible introduction to the field. We additionally provide a <a href="https://github.com/awesome-tnns/awesome-tnns">dictionary listing all the model architectures</a> in one unifying notation.</p><p>➡️ To further unify the field, we organized the first <a href="https://pyt-team.github.io/topomodelx/challenge/index.html">Topological Deep Learning Challenge</a>, hosted at the <a href="https://www.tagds.com/events/conference-workshops/tag-ml23">2023 ICML TAG workshop</a> and recorded via this white paper by <a href="https://proceedings.mlr.press/v221/papillon23a.html">Papillon et al</a>. The goal was to foster reproducible research by crowdsourcing the open-source implementation of neural networks on topological domains. As part of the challenge, participants from around the world contributed implementations of pre-existing topological deep learning models in <a href="https://github.com/pyt-team/TopoModelX">TopoModelX</a>. Each submission was rigorously unit-tested and included benchmark training on datasets loaded from <a href="https://github.com/pyt-team/TopoNetX">TopoNetX</a>. It is our hope that this one-stop-shop suite of consistently implemented models will help practitioners test-drive topological methods for new applications and developments in 2024.</p><p><strong><em>Bastian Rieck (Helmholtz &amp; TUM)</em></strong></p><p>2023 was an exciting year for topology-driven machine learning methods. On the one hand, we saw more integrations with geometrical concepts like curvature, thus demonstrating the versatility of hybrid geometrical-topological models. For instance, in <a href="https://arxiv.org/abs/2301.12906">‘Curvature Filtrations for Graph Generative Model Evaluation,’</a> we showed how to employ curvature as a way to select suitable graph generative models. Here, curvature serves as a ‘lens’ that we use to extract graph structure information, while we employ persistent homology, a topological method, to compare this information in a consistent fashion.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*yAl2YK92aFGLIlXhz4yAtQ.png" /><figcaption>An overview of the pipeline for evaluating graph generative models using discrete curvature. The ordering on edges gives rise to a curvature filtration, followed by a corresponding persistence diagram and landscape. For graph generative models, we select a curvature, apply this framework element-wise, and evaluate the similarity of the generated and reference distributions by comparing their average landscapes. Source: <a href="https://arxiv.org/abs/2301.12906">Southern, Wayland, Bronstein, and Rieck.</a></figcaption></figure><p>➡️ Another direction that serves to underscore that topology-driven methods are becoming a staple in graph learning research uses topology to assess the expressivity of graph neural network models. Sometimes, as in a very fascinating work from NeurIPS 2023 by <a href="https://openreview.net/pdf?id=27TdrEvqLD">Immonen et al.</a> this even leads to novel models that leverage both geometrical and topological aspects of graphs in tandem! My own research also aims to contribute to this facet by specifically analyzing the <a href="https://arxiv.org/abs/2302.09826">expressivity of persistent homology in graph learning</a>.</p><blockquote>“2023 also was the cusp of moving away — or beyond — persistent homology. Despite being rightfully seen as the paradigmatic algorithm for topology-driven machine learning, algebraic topology and differential topology offer an even richer fabric that can be used to analyse data.” — Bastian Rieck (Helmholtz &amp; TUM)</blockquote><p>➡️ With my great collaborators, we started looking at some alternatives very recently and came up with the concept of <a href="https://arxiv.org/abs/2312.08515">neural differential forms</a>. Differential forms permit us to elegantly build a bridge between geometry and topology by means of the <a href="https://en.wikipedia.org/wiki/De_Rham_cohomology">de Rham cohomology</a> — a way to link the integration of certain objects (differential forms), i.e. a fundamentally <em>geometric</em> operation, to topological characteristics of input data. With some additional constructions, the de Rham cohomology permits us to learn geometric descriptions of graphs (or higher-order combinatorial complexes) and solve learning tasks without having to rely on message passing. The upshot are models with fewer parameters that are potentially more effective at solving such tasks. There’s more to come here, since we have just started scratching the surface!</p><p>🔮My hopeful predictions for 2024 are that we will:</p><p>1️⃣ see many more diverse tools from algebraic and differential topology applied to graphs and combinatorial complexes,</p><p>2️⃣ better understand message passing on higher-order input data, and</p><p>3️⃣ finally obtain better parallel algorithms for persistent homology to truly unleash its power in a deep learning setting. A <a href="https://link.springer.com/article/10.1007/s00454-023-00549-2">recent paper on spectral sequences</a> by Torras-Casas reports some very exciting results that show the great prospects of this technique.</p><h4><strong>Geometric Algebras</strong></h4><p><em>Johannes Brandstetter (JKU Linz) and David Ruhe (U Amsterdam)</em></p><blockquote>“In 2023, we saw the subfield of deep learning on geometric algebras (also known as <strong>Clifford algebras</strong>) take off. Previously, neural network layers formulated as operations on Clifford algebra <em>multivectors</em> were introduced by <a href="https://arxiv.org/abs/2209.04934">Brandstetter et al.</a> This year, the ‘geometric’ in ‘geometric algebra’ was clearly put into action.” — Johannes Brandstetter (JKU Linz) and David Ruhe (U Amsterdam)</blockquote><p>➡️ First, <a href="https://arxiv.org/abs/2302.06594">Ruhe et al.</a> applied the quintessence of modern (plane-based) geometric algebra by introducing <strong>Geometric Clifford Algebra Networks (GCAN)</strong>, neural network templates that model symmetry transformations described by various geometric algebras. We saw an intriguing application thereof by <a href="https://openaccess.thecvf.com/content/WACV2024/papers/Pepe_CGAPoseNetGCAN_A_Geometric_Clifford_Algebra_Network_for_Geometry-Aware_Camera_Pose_WACV_2024_paper.pdf">Pepe et al.</a> in <strong>CGAPoseNet</strong>, building a geometry-aware pipeline for camera pose regression. Next, <a href="https://arxiv.org/abs/2305.11141">Ruhe et al.</a> introduced<strong> Clifford Group Equivariant Neural Networks (CGENN)</strong>, building steerable O(n)- and E(n)-equivariant (graph) neural networks of any dimension via the Clifford group. <a href="https://openreview.net/forum?id=JNfpsiGS5E">Pepe et al.</a> apply CGENNs to a Protein Structure Prediction (PSP) pipeline, increasing prediction accuracies by up to 2.1%.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Ce_jLcT0amTIaoAybZroHw.png" /><figcaption>CGENNs (represented with ϕ) are able to operate on multivectors (elements of the Clifford algebra) in an O(n)- or E(n)-equivariant way. Specifically, when an action ρ(w) of the Clifford group, representing an orthogonal transformation such as a rotation, is applied to the data, the model’s representations corotate. Multivectors can be decomposed into scalar, vector, bivector, trivector, and even higher-order components. These elements can represent geometric quantities such as (oriented) areas or volumes. The action ρ(w) is designed to respect these structures when acting on them. Source: <a href="https://arxiv.org/abs/2305.11141">Ruhe et al.</a></figcaption></figure><p>➡️ Coincidently, <a href="https://arxiv.org/abs/2305.18415">Brehmer et al.</a> formulated <strong>Geometric Algebra Transformer(GATr)</strong>, a scalable Transformer architecture that harnesses the benefits of representations provided by the projective geometric algebra and the scalability of Transformers to build E(3)-equivariant architectures. The GATr architecture was extended to other algebras by <a href="https://arxiv.org/abs/2311.04744">Haan et al.</a> who also examine which flavor of geometric algebra is best suited for your E(3)-equivariant machine learning problem.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*DMWj0RcgzHZxam5jzd7e9A.png" /><figcaption>Overview of the GATr architecture. Boxes with solid lines are learnable components, those with dashed lines are fixed. Source: <a href="https://arxiv.org/abs/2305.18415">Brehmer et al.</a></figcaption></figure><p>🔮 In 2024, we can expect exciting new applications from these advancements. Some examples include the following.</p><p>1️⃣ We can expect explorations of their applicability to molecular data, drug design, neural physics emulations, crystals, etc. Other geometry-aware applications include 3D rendering, pose estimations, and planning for, e.g., robot arms.</p><p>2️⃣ We can expect the extension of geometric algebra-based networks to other neural network architectures, such as convolutional neural networks.</p><p>3️⃣ Next, the generality of the CGENN allows for explorations in other dimensions, e.g., 2D, but also in settings where data of various dimensionalities should be processed together. Further, they enable non-Euclidean geometries, which have several use cases in relativistic physics.</p><p>4️⃣ Finally, GATr and CGENN can be extended and applied to projective, conformal, hyperbolic, or elliptic geometries.</p><h4><strong>PDEs</strong></h4><p><em>Johannes Brandstetter (JKU Linz)</em></p><p>Concerning the landscape of neural PDE modelling, what topics have surfaced or gathered momentum through 2023?</p><p>1️⃣ To begin, there is a noticeable trend towards modelling PDEs on and within intricate geometries, necessitating a mesh-based discretization of space. This aligns with the overarching goal to address increasingly realistic real world problems. For example, <a href="https://arxiv.org/abs/2309.00583">Li et al</a>. have introduced <strong>Geometry-Informed Neural Operator (GINO)</strong> for large-scale 3D PDEs.</p><p>2️⃣ Secondly, the development of neural network surrogates for Lagrangian-based simulations is becoming increasingly intriguing. The Lagrangian discretization of space uses finite material points which are tracked as fluid parcels through space and time. The most prominent Lagrangian discretization scheme is called smoothed particle hydrodynamics (SPH), which is the numerical baseline in the <strong>LagrangeBench</strong> benchmark dataset provided by <a href="https://arxiv.org/abs/2309.16342">Toshev et al.</a></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*7l_3FzymQF6stilNiTNIBQ.png" /><figcaption>Time snapshots of our datasets, at the initial time (top), 40% (middle), and 95% (bottom) of the trajectory. Color temperature represents velocity magnitude. (a) Taylor Green vortex (2D and 3D), (b) Reverse Poiseuille flow (2D and 3D), © Lid-driven cavity (2D and 3D), (d) Dam break (2D). Source: LagrangeBench by <a href="https://arxiv.org/abs/2309.16342">Toshev et al.</a></figcaption></figure><p>3️⃣ Thirdly, diffusion-based modelling is also not stopping for PDEs. We roughly see two directions. The first direction recasts the iterative nature of the diffusion process into a refinement of a candidate state initialised from noise and conditioned on previous timesteps. This iterative refinement was introduced in <strong>PDE-Refiner</strong> (<a href="https://arxiv.org/abs/2308.05732">Lippe et al.</a>) and a variant thereof was already applied in <strong>GenCast</strong> (<a href="https://arxiv.org/abs/2312.15796">Price et al.</a>). The second direction exerts the probabilistic nature of diffusion models to model chaotic phenomena such as 3D turbulence. Examples of this can be found in <strong>Turbulent Flow Simulation</strong> (<a href="https://arxiv.org/abs/2309.01745">Kohl et al.</a>) and in <strong>From Zero To Turbulence</strong> (<a href="https://arxiv.org/abs/2306.01776">Lienen et al.</a>). Especially for 3D turbulence, there are a lot of interesting things that will happen in the near future.</p><blockquote>“Weather modelling has become a great success story over the last months. There is potentially much more exciting stuff to come, especially regarding weather forecasting directly from observational data or when building weather foundation models.” — Johannes Brandstetter (JKU Linz)</blockquote><p>🔮 <strong>What to expect in 2024</strong>:</p><p>1️⃣ More work regarding 3D turbulence modelling.</p><p>2️⃣ Multi-modality aspects of PDEs might emerge. This could include combining different PDEs, different resolutions, or different discretization schemes. We are already seeing a glimpse thereof in e.g. <a href="https://arxiv.org/abs/2310.02994">Multiple Physics Pretraining for Physical Surrogate Models</a> by McCabe et al.</p><p><strong>Predictions from the 2023 post</strong></p><p>(1) Neural PDEs and their applications are likely to expand to more physics-related AI4Science subfields; computational fluid dynamics (CFD) will potentially be influenced by GNN.</p><p>✅ We are seeing 3D turbulence modelling, geometry-aware neural operators, particle-based neural surrogates, and a huge impact in e.g. weather forecasting.</p><p>(2) GNN based surrogates might augment/replace traditional well-tried techniques.</p><p>✅ Weather forecasting has become a great success story. Neural network based weather forecasts overtake traditional forecasts (medium range+local forecasts), e.g., <a href="https://www.science.org/doi/full/10.1126/science.adi2336">GraphCast</a> by Lam et al. and <a href="https://arxiv.org/abs/2306.06079">MetNet-3</a> by Andrychowicz et al.</p><h3>Robustness and Explainability</h3><p><em>Kexin Huang (Stanford)</em></p><blockquote>“As GNNs are getting deployed in various domains, their reliability and robustness have become increasingly important, especially in safety-critical applications (e.g. scientific discovery) where the cost of errors is significant.” — Kexin Huang (Stanford)</blockquote><p>1️⃣ When discussing the reliability of GNNs, a key criterion is <strong>uncertainty quantification</strong> — quantifying how much the model knows about the prediction. There are numerous works on estimating and calibrating uncertainty, also designed specifically for GNNs (e.g. <a href="https://proceedings.neurips.cc/paper_files/paper/2022/hash/5975754c7650dfee0682e06e1fec0522-Abstract-Conference.html">GATS</a>). However, they fall short of achieving pre-defined target coverage (i.e. % of points falling into the prediction set) both theoretically and empirically. I want to emphasize that this notion of having a coverage guarantee is <strong>critical</strong> especially in ML deployment for scientific discovery since practitioners often trust a model with statistical guarantees.</p><p><strong>2️⃣ Conformal prediction</strong> is an exciting direction in statistics where it has finite sample coverage guarantees and has been applied in many domains such as<a href="https://arxiv.org/abs/2107.07511"> vision and NLP</a>. But it is unclear if it can be used in graphs theoretically since it is not obvious if the exchangeability assumption holds for graph settings. In 2023, we see conformal prediction has been extended to graphs. Notably, <a href="https://arxiv.org/abs/2305.14535">CF-GNN</a> and <a href="https://proceedings.mlr.press/v202/h-zargarbashi23a/h-zargarbashi23a.pdf">DAPS</a> have derived theoretical conditions for conformal validity in transductive node-level prediction setting and also developed methods to reduce the prediction set size for efficient downstream usage. More recently, we have also seen conformal prediction extensions to <a href="https://arxiv.org/pdf/2306.14693v1.pdf">link prediction</a>, <a href="https://arxiv.org/abs/2306.07252">non-uniform split</a>, <a href="https://openreview.net/forum?id=homn1jOKI5">edge exchangeability</a>, and also adaptations for settings where exchangeability does not hold (such as <a href="https://arxiv.org/abs/2211.14555">inductive setting</a>).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*obPWmb-uDWytb9bVPX5zeQ.png" /><figcaption>Conformal prediction for graph-structured data. (1) A base GNN model (GNN) that produces prediction scores µ for node i. (2) Conformal correction. Since the training step is not aware of the conformal calibration step, the size/length of prediction sets/intervals (i.e. efficiency) are not optimized. We use a topology-aware correction model that takes µ as the input node feature and aggregates information from its local subgraph to produce an updated prediction µ˜. (3) Conformal prediction. We prove that in a transductive random split setting, graph exchangeability holds given permutation invariance. Thus, standard CP can be used to produce a prediction set/interval based on µ˜ that includes true label with pre-specified coverage rate 1-α. Source: <a href="https://arxiv.org/abs/2305.14535">Huang et al.</a></figcaption></figure><p>🔮 Looking ahead, we expect more extensions to cover a wide range of GNN deployment use cases. Overall, I think having statistical guarantees for GNNs is very nice because it enables the trust of practitioners to use GNN predictions.</p><h3>Graph Transformers</h3><p><em>Chen Lin (Oxford)</em></p><p>💡 In 2023, we have seen the continuation of the rise of Graph Transformers. It has become the <strong>common GNN design</strong>, e.g., in <a href="https://arxiv.org/abs/2305.18415">GATr</a>, the authors attribute its popularity to its <em>“favorable scaling properties, expressiveness, trainability, and versatility”</em>.</p><p>1️⃣ <strong>Expressiveness of GTs. </strong>As mentioned in the GNN Theory section, recent work from <a href="https://arxiv.org/abs/2301.11956">Cai et al. (2023)</a> shows the equivalence between MPNNs with a Virtural Node and GTs under a <em>non-uniform setting. </em>This poses a question on how powerful are GTs and what is the source of their representation ability. <a href="https://arxiv.org/abs/2301.09505">Zhang et al. (2023)</a> successfully combine a new powerful positional embedding (PE) to improve the expressiveness of their GTs, achieving expressivity over the biconnectivity problem. This gives evidence of the importance of PEs to the expressiveness of GTs. A recent submission <a href="https://openreview.net/pdf?id=JfjduOxrTY">GPNN</a> provides a clearer view on the central role of the positional encoding. It has been shown that one can generalize the proof in <a href="https://arxiv.org/abs/2301.09505">Zhang et al. (2023)</a> to show how GTs’ expressiveness is decided by various positional encodings.</p><p><strong>2️⃣</strong> <strong>Positional (Structural) Encoding. </strong>Given the importance of PE/SE to GTs, now we turn to the design of those expressive features usually derived from existing graph invariants. In 2022, <a href="https://arxiv.org/abs/2205.12454">GraphGPS</a> observed a huge empirical success by combining GTs with various (or even multiple) PE/SEs. In 2023, more powerful PE/SE is available.</p><p><strong>Relative Random Walk PE (RRWP)</strong> proposed by <a href="https://arxiv.org/abs/2305.17589">Ma et al</a> generalizes the random walk structural encoding with the relational part. Together with a new variant of attention mechanism, <strong>GRIT</strong> achieves a strong empirical performance compared with existing PE/SEs on property prediction benchmarks (SOTA on ZINC). Theoretically, RRWP can approximate the Shortest path distance, personalized PageRank, and heat kernel with a specific choice of parameters. With RRWP, GRIT is more expressive than SPD-WL.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*wDd37zUf83fvCZ3LyVhqhA.png" /><figcaption>RRWP visualization for the fluorescein molecule, up to the 4th power. Thicker and darker edges indicate higher edge weight. Probabilities for longer random walks reveal higher-order structures (e.g., the cliques evident in 3-RW and the star patterns in 4-RW). Source: <a href="https://arxiv.org/abs/2305.17589">Ma et al</a>.</figcaption></figure><p><a href="https://arxiv.org/abs/2302.11556">Puny et al</a> proposed a new theoretical framework for expressivity based on <strong>Equivariant Polynomials </strong>where the expressivity of common GNNs can be improved by having the polynomial features, computed with tensor contractions based on the equivariant basis, as positional encodings. The empirical results are surprising: GatedGCNs is improved from a test MAE of 0.265 to 0.106 with the d-expressive polynomials. It will be very interesting to see if someone combines this with GTs in the future.</p><p><strong>3️⃣ Efficient GTs. </strong>It remains challenging for GTs to be applied to large graphs due to the O(N²) complexity. In 2023, we saw more works trying to eliminate such difficulty by lowering the computation complexity of GTs. <a href="https://arxiv.org/abs/2210.02997">Deac et al</a> used <a href="https://en.wikipedia.org/wiki/Expander_graph">expander graphs</a> for the propagation, which is regularly connected with few edges.<strong> </strong><a href="https://arxiv.org/abs/2303.06147">Exphormer</a> extended this idea to GT by combining expander graphs with the local neighborhood aggregation and virtual node. Exphormer allows graph transformers to scale to larger graphs (as large as <em>ogbn-arxiv</em> with 169K nodes). It also achieved strong empirical results and ranked top on several <a href="https://github.com/vijaydwivedi75/lrgb">Long-Range Graph Benchmark</a> tasks.</p><p>🔮 <strong>Moving forward to 2024:</strong></p><ol><li>A better understanding of self-attention’s benefits on abstract beyond expressiveness.</li><li>Big open-source pre-trained equivariant GT in 2024!</li><li>More powerful positional encodings.</li></ol><h3>New Datasets &amp; Benchmarks</h3><p><strong>Structural biology:</strong> Pinder from VantAI, <a href="https://arxiv.org/abs/2308.05777">PoseBusters</a> from Oxford, <a href="https://arxiv.org/abs/2308.07413">PoseCheck</a> from The Other Place, <a href="https://openreview.net/forum?id=UfBIxpTK10">DockGen</a>, and LargeMix and UltraLarge datasets <a href="https://arxiv.org/abs/2310.04292">from Valence Labs</a></p><p><a href="http://tgb.mila.quebec/"><strong>Temporal Graph Benchmark</strong></a> (TGB): Until now, progress in temporal graph learning has been held back by the lack of large high-quality datasets, as well as the lack of proper evaluation thus leading to over-optimistic performance. TGB addresses this by introducing a collection of seven realistic, large-scale and diverse benchmarks for learning on temporal graphs, including both node-wise and link-wise tasks. Inspired by the success of OGB, TGB automates dataset downloading and processing as well as evaluation protocols, and allows users to compare model performance using a <a href="https://tgb-website.pages.dev/docs/leader_linkprop/">leaderboard</a>. Check out the <a href="https://towardsdatascience.com/temporal-graph-benchmark-bb5cc26fcf11">associated blog post</a> for more details.</p><p><a href="https://github.com/google-research-datasets/tpu_graphs"><strong>TpuGraphs</strong></a> from Google Research: the graph property prediction dataset of TPU computational graphs. The dataset provides 25x more graphs than the largest graph property prediction dataset (with comparable graph sizes), and 770x larger graphs on average compared to existing performance prediction datasets on machine learning programs. Google ran <a href="https://www.kaggle.com/competitions/predict-ai-model-runtime">Kaggle competition</a> based off TpuGraphs!</p><p><a href="https://github.com/tumaer/lagrangebench"><strong>LagrangeBench</strong></a>: A Lagrangian Fluid Mechanics Benchmarking Suite — where you can evaluate your favorite GNN-based simulator in a JAX-based environment (for JAX aficionados).</p><p><a href="https://relbench.stanford.edu/"><strong>RelBench</strong></a>: Relational Deep Learning Benchmark from Stanford and Kumo.AI: make time-based predictions over relational databases (which you can model as graphs or hypergraphs).</p><p><a href="https://github.com/google-deepmind/materials_discovery?tab=readme-ov-file#dataset"><strong>The GNoMe dataset</strong></a> from Google DeepMind: 381k more novel stable materials for your materials discovery and ML potentials models!</p><h3>Conferences, Courses &amp; Community</h3><p>The main events in the graph and geometric learning world (apart from big ML conferences) grow larger and more mature: <a href="https://logconference.org/">The Learning on Graphs Conference (LoG)</a>, <a href="https://www.moml.mit.edu/">Molecular ML</a> (MoML), and the <a href="https://snap.stanford.edu/graphlearning-workshop-2023/">Stanford Graph Learning Workshop</a>. The LoG conference features a cool format with the remote-first conference and dozens of local meetups organized by community members spanning the whole globe from China to UK &amp; Europe to the US West Coast 🌏🌍🌎 .</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*e-XqMpOmfsAXPErb" /><figcaption>The LoG meetups in Amsterdam, Paris, Tromsø, and Shanghai. Source: Slack of the LoG community</figcaption></figure><h4>Courses, books, and educational resources</h4><ul><li><a href="https://github.com/chaitjo/geometric-gnn-dojo">Geometric GNN Dojo</a> — a pedagogical resource for beginners and experts to explore the design space of GNNs for geometric graphs (pairs best with the recent Hitchhiker’s Guide to Geometric GNNs).</li><li><a href="https://github.com/atong01/conditional-flow-matching">TorchCFM</a> — the main entrypoint to the world of flow matching.</li><li>The <a href="https://github.com/pyt-team">PyT team</a> maintains TopoNetX, TopoModelX, and TopoEmbedX — the most hands-on libraries to jump into topological deep learning.</li><li>The book on <a href="https://maurice-weiler.gitlab.io/#cnn_book">Equivariant and Coordinate Independent Convolutional Networks: A Gauge Field Theory of Neural Networks</a> by Maurice Weiler, Patrick Forré, Erik Verlinde, and Max Welling — brings together the findings on the representation theory and differential geometry of equivariant CNNs</li></ul><h4>Surveys</h4><ul><li><strong>ML for Science in Quantum, Atomistic, and Continuum systems</strong> by well over 60 authors from 23 institutions (<a href="https://arxiv.org/abs/2307.08423">Zhang, Wang, Helwig, Luo, Fu, Xie et al.</a>)</li><li><strong>Scientific discovery in the age of artificial intelligence</strong> by <a href="https://www.nature.com/articles/s41586-023-06221-2">Wang et al</a> published in Nature.</li></ul><h4>Prominent seminar series</h4><ul><li><a href="https://portal.valencelabs.com/logg">Learning on Graphs &amp; Geometry</a></li><li><a href="https://portal.valencelabs.com/m2d2">Molecular Modeling and Drug Discovery (M2D2)</a></li><li><a href="https://www.youtube.com/@Vant_AI">VantAI reading group</a></li><li><a href="https://log-2.github.io/">Oxford LoG2 seminar series</a></li></ul><h4>Slack communities</h4><ul><li><a href="https://join.slack.com/t/logag/shared_invite/zt-22y7n3k7a-FHwX31gc85yZCa0uF8BU7w">LoGaG</a></li><li><a href="https://join.slack.com/t/logconference/shared_invite/zt-27nv8ba1y-pXspnAzgLOMdDzfKgpOafg">LOG conference</a></li><li><a href="https://data.pyg.org/slack.html">PyG</a></li></ul><h3>Memes of 2023</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*jpxlRK1BMu_kEXh5GZJpxA.png" /><figcaption>Commemorating the successes of flow matching in 2023 in the meme and unique t-shirts brought to NeurIPS’23. Right: Hannes Stärk and Michael Galkin are making a statement at NeurIPS’23. Images by Michael Galkin</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/495/1*dPvq5YYXyIWMeu0BVFv2vA.jpeg" /><figcaption>GNN aggregation functions are actually portals to category theory (Created by Petar Veličković)</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*iKgI0-kYQPGTw3EaF33-Pw.png" /><figcaption>Michael Bronstein continues to harass Google by demanding his <a href="https://www.cs.ox.ac.uk/news/1996-full.html">DeepMind chair</a> at every ML conference, but so far, he has only been offered stools (photo credits: Jelani Nelson and Thomas Kipf).</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*qr84KeOKOpcW2pm3" /><figcaption>The authors of this blog post congratulate you upon completing the long read. Michael Galkin and Michael Bronstein with the Meme of 2022 at ICML 2023 in Hawaii (Photo credit: Ben Finkelshtein)</figcaption></figure><p><em>For additional articles about geometric and graph deep learning, see </em><a href="https://medium.com/@mgalkin"><em>Michael Galkin</em></a><em>’s and </em><a href="https://medium.com/@michael-bronstein"><em>Michael Bronstein</em></a><em>’s Medium posts and follow the two Michaels (</em><a href="https://twitter.com/michael_galkin"><em>Galkin</em></a><em> and </em><a href="https://twitter.com/mmbronstein"><em>Bronstein</em></a><em>) on Twitter.</em></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=3af5d38376e1" width="1" height="1" alt=""><hr><p><a href="https://medium.com/data-science/graph-geometric-ml-in-2024-where-we-are-and-whats-next-part-i-theory-architectures-3af5d38376e1">Graph &amp; Geometric ML in 2024: Where We Are and What’s Next (Part I — Theory &amp; Architectures)</a> was originally published in <a href="https://medium.com/data-science">TDS Archive</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[ULTRA: Foundation Models for Knowledge Graph Reasoning]]></title>
            <link>https://medium.com/data-science/ultra-foundation-models-for-knowledge-graph-reasoning-9f8f4a0d7f09?source=rss-4d4f8ddd1e68------2</link>
            <guid isPermaLink="false">https://medium.com/p/9f8f4a0d7f09</guid>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[thoughts-and-theory]]></category>
            <category><![CDATA[knowledge-graph]]></category>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[graph-machine-learning]]></category>
            <dc:creator><![CDATA[Michael Galkin]]></dc:creator>
            <pubDate>Fri, 03 Nov 2023 17:03:49 GMT</pubDate>
            <atom:updated>2023-11-03T17:19:37.114Z</atom:updated>
            <content:encoded><![CDATA[<h4>What’s new in Graph ML?</h4><h4>One model to rule them all</h4><p>Training a single generic model for solving arbitrary datasets is always a dream for ML researchers, especially in the era of foundation models. While such dreams have been realized in perception domains like images or natural languages, whether they can be reproduced in reasoning domains (like graphs) remains an open challenge.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*o1UKqk6Eb8HnNZEBVckGJw.png" /><figcaption>Image by Authors edited from the output of DALL-E 3.</figcaption></figure><p>In this blog post, we prove such a generic reasoning model exists, at least for knowledge graphs (KGs). We create <strong>ULTRA</strong>, a single pre-trained reasoning model that generalizes to new KGs of arbitrary entity and relation vocabularies, which serves as a default solution for any KG reasoning problem.</p><p><em>This post is based on our recent paper (</em><a href="https://arxiv.org/abs/2310.04562"><em>preprint</em></a><em>) and was written together with </em><a href="https://github.com/KatarinaYuan"><em>Xinyu Yuan</em></a><em> (Mila), </em><a href="https://kiddozhu.github.io/"><em>Zhaocheng Zhu</em></a><em> (Mila), and </em><a href="https://www.cs.purdue.edu/homes/ribeirob/"><em>Bruno Ribeiro</em></a><em> (Purdue / Stanford). Follow </em><a href="https://twitter.com/michael_galkin"><em>Michael</em></a><em>, </em><a href="https://twitter.com/XinyuYuan402"><em>Xinyu</em></a><em>, </em><a href="https://twitter.com/zhu_zhaocheng"><em>Zhaocheng</em></a><em>, and </em><a href="https://twitter.com/brunofmr"><em>Bruno</em></a><em> on Twitter for more Graph ML content.</em></p><h3>Outline</h3><ol><li><a href="#974c">Why KG representation learning is stuck in 2018</a></li><li><a href="#062b">Theory: What makes a model inductive and transferable?</a></li><li><a href="#fbb0">Theory: Equivariance in multi-relational graphs</a></li><li><a href="#86f8">ULTRA: A Foundation Model for KG Reasoning</a></li><li><a href="#2517">Experiments: Best even in the zero-shot inference, Scaling behavior</a></li><li><a href="#71ab">Code, Data, Checkpoints</a></li></ol><h3>Why KG representation learning is stuck in 2018</h3><p>The pretrain-finetune paradigm has been with us since 2018 when <a href="https://arxiv.org/abs/1802.05365">ELMo</a> and <a href="https://arxiv.org/abs/1801.06146">ULMFit</a> showed first promising results and they were later cemented with <a href="https://arxiv.org/abs/1810.04805">BERT</a> and <a href="https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf">GPT</a>.</p><p>In the era of <em>large language models</em> (LLM) and more general <em>foundation models</em> (FMs), we often have a single model (like GPT-4 or Llama-2) pre-trained on enormous amounts of data and capable of performing a sheer variety of language tasks in the zero-shot manner (or at least be fine-tuned on the specific dataset). These days, multimodal FMs even support language, vision, audio, and other modalities in the same one model.</p><p>Things work a little differently in Graph ML. Particularly, <strong>what’s up with representation learning on KGs at the end of 2023?</strong> The main tasks here are edge-level:</p><ul><li>Entity prediction (or knowledge graph completion) (h,r,?): given a head node and relation, rank all nodes in the graph that can potentially be true tails.</li><li>Relation prediction (h,?,t): given two nodes, predict a relation type between them</li></ul><p>Turns out, up until now it has been somewhere in pre-2018. The key problem is:</p><blockquote>Each KG has its own set of entities and relations, there is no single pre-trained model that would transfer to any graph.</blockquote><p>For example, if we look at Freebase (a KG behind Google Knowledge Graph) and Wikidata (the largest open-source KG), they have absolutely different sets of entities (86M vs 100M) and relations (1500 vs 6000). Is there any hope for current KG representation learning methods to be trained on one graph and transfer to another?</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*pb3UPyOR9Shbu5_-" /><figcaption>Different vocabularies of Freebase and Wikidata. Image by Authors.</figcaption></figure><p>❌ Classical transductive methods like TransE, ComplEx, RotatE, and hundreds of other embedding-based methods learn a <strong>fixed set of entities and relation types</strong> from the training graph and cannot even support new nodes added to the same graph. Shallow embedding-based methods do not transfer (in fact, we believe there is no point in developing such methods anymore except for some student project exercises).</p><p>🟡 Inductive entity methods like <a href="https://openreview.net/forum?id=xMJWUKJnFSw">NodePiece</a> and <a href="https://arxiv.org/pdf/2106.06935.pdf">Neural Bellman-Ford Nets</a> do not learn entity embeddings. Instead, they parameterize training (seen) and new inference (unseen) nodes as a function of fixed relations. Since they <strong>learn only relation embeddings</strong>, it does allow them to transfer to graphs with new nodes but transfer to new graphs with different relations (like Freebase to Wikidata) is still beyond reach.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1021/0*AulBL23t9Xskv_Qt" /><figcaption>Relative entity representations enable inductive GNNs. Image by Authors.</figcaption></figure><p>What to do if you have <strong>both</strong> new entities and relations at inference time (a completely new graph)? If you don’t learn entity or relation embeddings, is the transfer theoretically possible? Let’s look into the theory then.</p><h3>Theory: What makes a model inductive and transferable?</h3><p>Let’s define the setup more formally:</p><ul><li>KGs are directed, multi-relational graphs with arbitrary sets of nodes and relation types</li><li>Graphs arrive <strong>without features</strong>, that is, we don’t assume the existence of textual descriptions (nor pre-computed feature vectors) of entities and relations.</li><li>Given a query (head, relation, ?), we want to rank all nodes in the underlying graph (inference graph) and maximize the probability of returning a true tail.</li><li><em>Transductive</em> setup: the set of nodes and entities is the same at training and inference time.</li><li><em>Inductive</em> (entity) setup: the set of relations has to be fixed at training time, but nodes can be different at training and inference</li><li><em>Inductive</em> (entity and relation) setup: both new unseen entities and relations are allowed at inference</li></ul><p>What do neural networks learn to be able to generalize to new data? The primary reference— the book on<a href="https://geometricdeeplearning.com/"> Geometric Deep Learning by Bronstein, Bruna, Cohen, and Veličković</a>—posits that it is a question of <em>symmetries and invariances</em>.</p><p>What are the learnable invariances in foundation models? LLMs are trained on a fixed vocabulary of tokens (sub-word units, bytes, or even randomly initialized vectors as in <a href="https://arxiv.org/abs/2305.16349">Lexinvariant LLMs</a>), vision models learn functions to project image patches, audio models learn to project audio patches.</p><blockquote>What are the learnable invariances for multi-relational graphs?</blockquote><p>First, we will introduce the invariances (equivariances) in standard <strong>homogeneous</strong> graphs.</p><p><em>Standard (single) permutation equivariant graph models:</em> A great leap in graph ML came when early GNN work (<a href="https://ro.uow.edu.au/cgi/viewcontent.cgi?article=10501&amp;context=infopapers">Scarselli et al. 2008</a>, <a href="https://arxiv.org/abs/1810.00826">Xu et al. 2018</a>, <a href="https://ojs.aaai.org/index.php/AAAI/article/view/4384">Morris et al. 2018</a>) has shown that inductive tasks on graphs benefited enormously from assuming that vertex IDs are arbitrary, such that the predictions of a graph model should not change if we reassigned vertex ID. This is known as <em>permutation equivariance</em> of the neural network on node IDs. This realization has created great excitement and a profusion of novel graph representation methods since, as long as the neural network is equivariant to node ID permutations, we can call it a graph model.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*A3zyGKK779PYm6tn" /><figcaption><em>Single-relational graphs. GNNs are equivariant to node permutations: Michael Jackson’s node vector will have the same value even after re-labeling node IDs. Image by Authors.</em></figcaption></figure><p>The permutation equivariance on node IDs allows GNNs to inductively (zero-shot) transfer the patterns learned from a training graph to another (different) test graph. This is a consequence of the equivariance, since the neural network cannot use node IDs to produce embeddings, it must use the graph structure. This creates what we know as <em>structural representations</em> in graphs (see <a href="https://iclr.cc/virtual_2020/poster_SJxzFySKwH.html">Srinivasan &amp; Ribeiro (ICLR 2020)</a>).</p><h3>Equivariance in multi-relational graphs</h3><p>Now edges in the graphs might have different relation types — is there any GNN theory for such graphs?</p><p>1️⃣ In our previous work, <a href="https://arxiv.org/abs/2211.17113">Weisfeiler and Leman Go Relational</a> (with Pablo Barceló, Christopher Morris, and Miguel Romero Orth, LoG 2022), we derived Relational WL — a WL expressiveness hierarchy for multi-relational graphs focusing more on node-level tasks. The great<a href="https://arxiv.org/abs/2302.02209"> follow-up work by Huang et al (NeurIPS 2023)</a> extended the theory to link prediction, formalized <em>conditional message passing,</em> and logical expressiveness using Relational WL. ✍️ Let’s remember <strong>conditional message passing</strong> — we’ll need it later — it provably improves link prediction performance.</p><p>The proposed addition of a global readout vector induced by incoming/outgoing edge direction resembles the <a href="https://arxiv.org/abs/2305.10498">recent work of Emanuele Rossi et al</a> on studying directionality in homogeneous MPNNs (read <a href="https://towardsdatascience.com/direction-improves-graph-learning-170e797e94fe">the blog post on Medium</a> for more details). Still, those works do not envision the case when even relations at test time are unseen.</p><p><em>2️⃣ Double permutation equivariant (multi-relational) graph models:</em> Recently, <a href="https://arxiv.org/abs/2302.01313">Gao et al. 2023</a> proposed the concept of <strong>double equivariance</strong> for multi-relational graphs. Double equivariance forces the neural network to be equivariant to the joint permutations of both node IDs and relation IDs. This ensures the neural network learns structural patterns between nodes and relations, which allows it to inductively (zero-shot) transfer the learned patterns to another graph with new nodes and new relations.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*OlG4XP85UM_CtLFc" /><figcaption><em>Double equivariance in multi-relational graphs. Permuting both node IDs and relation IDs does not change the relational structure. Hence, the output node states should be the same (but permuted). Image by Authors.</em></figcaption></figure><p>➡️ In our work, we find<em> the invariance of relation interactions</em>, that is, even if relation identities are different, their fundamental interactions remain the same, and those fundamental interactions can be captured by a <strong>graph of relations. </strong>In the graph of relations, each node is a relation type from the original graph. Two nodes in this graph will be connected if edges with those relation types in the original graph are incident (that is, they share a head or tail node). Depending on the incidence, we distinguish<strong> 4 edge types</strong> in the graph of relations:</p><ul><li><em>Head-to-head (h2h)</em> — two relations can start from the same head entity;</li><li><em>Tail-to-head (t2h)</em> — tail entity of one relation can be a head of another relation;</li><li><em>Head-to-tail (h2t)</em> — head entity of one relation can be a tail of another relation;</li><li><em>Tail-to-tail (t2t)</em> — two relations can have the same tail entity.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*ZkU03Xlbt7EkK_CV" /><figcaption><em>Different incidence patterns in the original graph produce different interactions in the graph of relations. The right-most: the example relation graph (inverse edges are omitted for clarity). Image by Authors</em></figcaption></figure><p>A few nice properties of the relation graph:</p><ul><li>It can be built from absolutely any multi-relational graph (with simple sparse matrix multiplications)</li><li>The 4 fundamental interactions never change because they just encode the basic topology — in directed graphs there always will be head and tail nodes, and we relations would have those incidence patterns</li></ul><blockquote>Essentially, learning representations over the relations graph can transfer to any multi-relational graph! This is the <em>learnable invariance</em>.</blockquote><p>In fact, it can be shown (we are already working on the formal proofs which will be available in an upcoming work 😉) that representing relations via their interactions in a graph of relations is a double equivariant model! This means that learned relational representations are independent of identities but rather rely on the joint interactions between relations, nodes, and nodes &amp; relations.</p><h3>ULTRA: A Foundation Model for KG Reasoning</h3><p>With all the theoretical foundations backing us up, we are now ready to introduce ULTRA.</p><p>ULTRA is a method for unified, learnable, and transferable graph representations. ULTRA leverages the invariances (and equivariances) of the <strong>graph of relations</strong> with its fundamental interactions and applies <strong>conditional message passing</strong> to get relative relational representations. Perhaps the coolest fact is that</p><blockquote>a single pre-trained ULTRA model can run 0-shot inference on any possible multi-relational graph and be fine-tuned on any graph.</blockquote><p>In other words, ULTRA is pretty much a foundation model that can run inference on any graph input (with already good performance) and be fine-tuned on any target graph of interest.</p><p>The crucial component of ULTRA is in <em>relative</em> relation representations constructed from the graph of relations. Given a query (Michael Jackson, genre, ?), we first initialize the genre node in the graph of relations with the all-ones vector (all other nodes are initialized with zeros). Running a GNN, the resulting node embeddings of the relation graph are conditioned on the genre node — it means that each starting initialized relation will have its own matrix of relational features, and that’s very helpful from many theoretical and practical aspects!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*RszqzszKboQw995B" /><figcaption><em>ULTRA employs relative relation representations (a labeling trick over the graph of relations) such that each relation (eg, “genre”) has its own unique matrix of all relation representations. Image by Authors.</em></figcaption></figure><p>Practically, given an input KG and a (h, r, ?) query, ULTRA executes the following actions:</p><ol><li>Construction of the graph of relations;</li><li>Get relation features from the conditional message passing GNN on the graph of relations (conditioned on the initialized query relation r);</li><li>Use the obtained relational representations for the inductive link predictor GNN conditioned on the initialized head node h;</li></ol><p>Steps 2 and 3 are implemented via slightly different modifications of the <a href="https://arxiv.org/pdf/2106.06935.pdf">Neural Bellman-Ford net (NBFNet)</a>. ULTRA only learns embeddings of the 4 fundamental interactions (h2t, t2t, t2h, h2h) and GNN weights — pretty small overall. The main model we experimented with has only 177k parameters.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*BPXu-a57qFKky6Lq" /><figcaption><em>Three main steps taken by ULTRA: (1) building a relation graph; (2) running conditional message passing over the relation graph to get relative relation representations; (3) use those representations for inductive link predictor GNN on the entity level. Image by Authors.</em></figcaption></figure><h3>Experiments: Best even in the zero-shot inference and Fine-tuning</h3><p>We pre-trained ULTRA on 3 standard KGs based on Freebase, Wikidata, and Wordnet, and ran 0-shot link prediction on 50+ other KGs of various sizes from 1k — 120k nodes and 2k edges — 1.1M edges.</p><p>Averaged across the datasets with known SOTA, a single pre-trained ULTRA model is <strong>better in the 0-shot inference mode</strong> than existing SOTA models trained specifically on each graph 🚀Fine-tuning improves the performance even 10% further. It’s particularly amazing that a single trained ULTRA model can scale to graphs of such different sizes (100x difference in node size and 500x in the edge sizes) whereas GNNs are known to suffer from size generalization issues (see the prominent works by <a href="https://arxiv.org/abs/2010.08853">Yehudai et al, ICML 2021</a> and <a href="https://arxiv.org/abs/2205.15117">Zhou et al, NeurIPS 2022</a>).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*RFmXmyupmLpMWxi_" /><figcaption>A single pre-trained ULTRA is better even in the 0-shot inference mode than supervised SOTA modes trained end-to-end on specific graphs (look at the Average column). Fine-tuning improves the performance even further. Image by Authors</figcaption></figure><p>🙃 In fact, with 57 tested graphs, we ran a bit out of KGs to test ULTRA on. So if you have a fresh new benchmark hidden somewhere — let us know!</p><h3>Scaling Behavior</h3><p>We can bump the zero-shot performance even more by adding more graphs to the pre-training mixture although we do observe certain performance saturation after training on 4+ graphs.</p><p>The church of <a href="https://arxiv.org/abs/2001.08361">Scaling Laws</a> predicts even better performance with bigger models trained on more qualitative data, so it’s definitely on our agenda.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*xTouwTJjPII4XM0r" /><figcaption>Zero-shot performance increases with more diverse graphs in the pre-training mix. Image by Authors.</figcaption></figure><h3>Conclusion: Code, Data, Checkpoints</h3><p>So foundation models for KG reasoning are finally here, we are past that 2018 threshold! A single pre-trained ULTRA model can perform link prediction on any KG (multi-relational graph) from any domain. You really just need a graph with more than 1 edge type to get going.</p><p>📈 Practically, ULTRA demonstrates very promising performance on a variety of KG benchmarks already in the 0-shot mode, but you can bump the performance even further with a short fine-tuning.</p><p>We make all the code, training data, and pre-trained model checkpoints available on GitHub so you can start running ULTRA on your data right away!</p><p>📜 preprint: <a href="https://arxiv.org/abs/2310.04562">arxiv</a></p><p>🛠️ Code, data: <a href="https://github.com/DeepGraphLearning/ULTRA">Githtub repo</a></p><p>🍪 Checkpoints: 2 checkpoints (2 MB each) in the <a href="https://github.com/DeepGraphLearning/ULTRA">Github repo</a></p><p>🌎 Project website: <a href="https://deepgraphlearning.github.io/project/ultra">here</a></p><p>As a closing remark, KG reasoning just represents a fraction of the many interesting problems in the reasoning domain, and the majority still don’t have a generic solution. We believe the success of KG reasoning will bring more breakthroughs in other reasoning domains (for example, we recently found that <a href="https://arxiv.org/abs/2310.07064">LLMs can actually learn and employ textual rules</a>). Let’s stay optimistic about the future of reasoning!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=9f8f4a0d7f09" width="1" height="1" alt=""><hr><p><a href="https://medium.com/data-science/ultra-foundation-models-for-knowledge-graph-reasoning-9f8f4a0d7f09">ULTRA: Foundation Models for Knowledge Graph Reasoning</a> was originally published in <a href="https://medium.com/data-science">TDS Archive</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Graph Machine Learning @ ICML 2023]]></title>
            <link>https://medium.com/data-science/graph-machine-learning-icml-2023-9b5e4306a1cc?source=rss-4d4f8ddd1e68------2</link>
            <guid isPermaLink="false">https://medium.com/p/9b5e4306a1cc</guid>
            <category><![CDATA[graph-machine-learning]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[ai]]></category>
            <category><![CDATA[editors-pick]]></category>
            <category><![CDATA[machine-learning]]></category>
            <dc:creator><![CDATA[Michael Galkin]]></dc:creator>
            <pubDate>Sun, 06 Aug 2023 02:07:28 GMT</pubDate>
            <atom:updated>2023-08-06T14:15:32.522Z</atom:updated>
            <content:encoded><![CDATA[<h4>What’s new in Graph ML?</h4><h4>Recent advancements and hot trends, August 2023 edition</h4><p>Magnificent beaches and tropical Hawaiian landscapes 🌴did not turn brave scientists away from attending the <a href="https://icml.cc/Conferences/2023">International Conference on Machine Learning</a> in Honolulu and presenting their recent work! Let’s see what’s new in our favorite Graph Machine Learning area.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*38Um1nDjooXaqvaxVeHpow.jpeg" /><figcaption>Image By Author.</figcaption></figure><p><em>Thanks Santiago Miret for proofreading the post.</em></p><p>To make the post less boring about papers, I took some photos around Honolulu 📷</p><h3>Table of contents (clickable):</h3><ol><li><a href="#8d41">Graph Transformers: Sparser, Faster, and Directed</a></li><li><a href="#0d40">Theory: VC dimension of GNNs, deep dive in over-squashing</a></li><li><a href="#c5be">New GNN architectures: delays and half-hops</a></li><li><a href="#7e7c">Generative Models — Stable Diffusion for Molecules, Discrete diffusion</a></li><li><a href="#b0d0">Geometric Learning: Geometric WL, Clifford Algebras</a></li><li><a href="#32a5">Molecules: 2D-3D pretraining, Uncertainty Estimation in MD</a></li><li><a href="#1ff6">Materials &amp; Proteins: CLIP for proteins, Ewald Message Passing, Equivariant Augmentations</a></li><li><a href="#1891">Cool Applications: Algorithmic reasoning, Inductive KG completion, GNNs for mass spectra</a></li><li><a href="#5eb2">The Concluding Meme Part</a></li></ol><h3><strong>Graph Transformers: Sparser, Faster, and Directed</strong></h3><p>We <a href="https://towardsdatascience.com/graphgps-navigating-graph-transformers-c2cc223a051c">presented</a> <strong>GraphGPS</strong> about a year ago and it is pleasing to see many ICML papers building upon our framework and expanding GT capabilities even further.</p><p><strong>➡️ Exphormer</strong> by <a href="https://openreview.net/forum?id=3Ge74dgjjU">Shirzad, Velingker, Venkatachalam et al</a> adds a missing piece of graph-motivated sparse attention to GTs: instead of BigBird or Performer (originally designed for sequences), Exphormer’s attention builds upon 1-hop edges, virtual nodes (connected to all nodes in a graph), and a neat idea of <a href="https://en.wikipedia.org/wiki/Expander_graph">expander edges</a>. Expander graphs have a constant degree and are shown to approximate fully-connected graphs. All components combined, attention costs <em>O(V+E)</em> instead of <em>O(V²)</em>. This allows Exphormer to outperform GraphGPS almost everywhere and scale to really large graphs of up to 160k nodes. Amazing work and all chances to make Exphormer the standard sparse attention mechanism in GTs 👏.</p><p><strong>➡️ </strong>Concurrently with graph transformers, expander graphs can already be used to enhance the performance of any MPNN architecture as shown in <a href="https://arxiv.org/abs/2210.02997">Expander Graph Propagation</a> by <em>Deac, Lackenby, and Veličković</em>.</p><p>In a similar vein, <a href="https://openreview.net/forum?id=1EuHYKFPgA">Cai et al</a> show that MPNNs with virtual nodes can approximate linear Performer-like attention such that even classic GCN and GatedGCN imbued with virtual nodes show pretty much a SOTA performance in long-range graph tasks (we <a href="https://towardsdatascience.com/lrgb-long-range-graph-benchmark-909a6818f02c">released</a> the <a href="https://github.com/vijaydwivedi75/lrgb">LGRB benchmark</a> last year exactly for measuring long-range capabilities of GNNs and GTs).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/970/0*ou8TLbw-oV5Zt4wr" /><figcaption>Source: <a href="https://openreview.net/forum?id=3Ge74dgjjU">Shirzad, Velingker, Venkatachalam et al</a></figcaption></figure><p><strong>➡️ </strong>A few <strong>patch-based</strong> subsampling approaches for GTs inspired by vision models: <a href="https://openreview.net/forum?id=l7yTbEWuOQ"><strong>“A Generalization of ViT/MLP-Mixer to Graphs”</strong></a> by <em>He et al</em> split the input into several patches, encode each patch with a GNN into a token, and run a transformer over those tokens.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*H9xtpfGa7ot0CJTt" /><figcaption>Source: <a href="https://openreview.net/forum?id=l7yTbEWuOQ">“A Generalization of ViT/MLP-Mixer to Graphs”</a> by He et al</figcaption></figure><p>In <strong>GOAT</strong> by <a href="https://openreview.net/forum?id=Le2dVIoQun">Kong et al</a>, node features are projected into a codebook of K clusters with K-Means, and a sampled 3-hop neighborhood of each node attends to the codebook. GOAT is a 1-layer model and scales to graphs of millions of nodes.</p><p><strong>➡️ Directed graphs</strong> got some transformer love as well 💗. <a href="https://openreview.net/forum?id=a7PVyayyfp"><strong>“Transformers Meet Directed Graphs”</strong></a> by <em>Geisler et al </em>introduces Magnetic Laplacian — a generalization of a Laplacian for directed graphs with a non-symmetric adjacency matrix. Eigenvectors of the Magnetic Laplacian paired with directed random walks are strong input features for the transformer that enable setting a new SOTA on the <a href="https://ogb.stanford.edu/docs/leader_graphprop/#ogbg-code2">OGB Code2</a> graph property prediction dataset by a good margin!</p><p>🏅 Last but not least, we have a new SOTA GT on the community standard ZINC dataset — <strong>GRIT</strong> by <a href="https://openreview.net/forum?id=HjMdlNgybR">Ma, Lin, et al</a> incorporates the full <em>d</em>-dimensional random walk matrix, coined as relative random walk probabilities (RRWP), as edge features to the attention computation (for comparison, popular <a href="https://openreview.net/forum?id=wTTjnvGphYj">RWSE</a> features are just the diagonal elements of this matrix). RRWP are provably more powerful than shortest path distance features and set a record-low 0.059 MAE on ZINC (down from 0.070 by GraphGPS). GRIT often outperforms GPS in other benchmarks as well 👏. In a similar vein, <a href="https://openreview.net/forum?id=1Nx2n1lk5T">Eliasof et al</a> propose a neat idea to combine random and spectral features as positional encodings that outperform RWSE but were not tried with GTs.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*0IfBXldqPt2lEvRyGYnX2A.jpeg" /><figcaption>Image by Author.</figcaption></figure><h3><strong>Theory: VC dimension of GNNs, deep dive into over-squashing</strong></h3><p><strong>➡️ </strong><a href="https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_dimension">VC dimension</a> measures model capacity and expressiveness. It is studied well for classical ML algorithms but, surprisingly, has never been applied to study GNNs. In<strong> </strong><a href="https://openreview.net/forum?id=rZN3mc5m3C"><strong>“WL meet VC”</strong></a> by <em>Morris et al</em>, the connection between the WL test and VC dimension is finally uncovered — turns out it the VC dimension can be bounded by the bitlength of GNN weights, i.e., float32 weights would imply the VC dimension of 32. Furthermore, the VC dimension depends logarithmically on the number of unique WL colors in the given task and polynomially on the depth and number of layers. This is a great theoretical result and I’d encourage you to have a look!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*D-gLnhp30lQJwBav" /><figcaption>Source: <a href="https://openreview.net/forum?id=rZN3mc5m3C">“WL meet VC”</a> by <em>Morris et al</em></figcaption></figure><p>🍊🖐️ The over-squashing effect — information loss when you try to stuff messages from too many neighboring nodes — is another common problem of MPNNs, and we don’t fully understand how to properly deal with it. This year, there were 3 papers dedicated to this topic. Perhaps the most foundational is the work by <a href="https://openreview.net/forum?id=t2tTfWwAEl"><strong>Di Giovanni et al</strong></a> that explains how MPNNs width, depth, and graph topology affect over-squashing.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/806/0*V_c6Qe1rq60ifMZz" /><figcaption>Source: <a href="https://openreview.net/forum?id=t2tTfWwAEl"><strong>Di Giovanni et al</strong></a></figcaption></figure><p>Turns out that <strong>width</strong> might help (but with generalization issues), <strong>depth</strong> does <strong>not</strong> really help, and <strong>graph topology</strong> (characterized by the commute time between nodes) plays the most important role. We can reduce the commute time by various <em>graph rewiring</em> strategies (adding and removing edges based on spatial or spectral properties), and there are many of them (you might have heard about the <a href="https://openreview.net/forum?id=7UmjRGzp-A">Ricci flow-based rewiring</a> that took home the Outstanding Paper award at ICLR 2022). In fact, there is a <a href="https://arxiv.org/abs/2306.03589">follow-up work</a> to this study that goes even deeper and derives some impossibility statements wrt over-squashing and some MPNN properties — I’d highly encourage to read it as well!</p><p><strong>➡️ </strong>Effective resistance is one example of spatial rewiring strategies, and <a href="https://openreview.net/forum?id=50SO1LwcYU"><strong>Black et al</strong></a><strong> </strong>study it in great detail. The Ricci flow-based rewiring works with graph curvature and is studied further in the work by <a href="https://openreview.net/forum?id=eWAvwKajx2">Nguyen et al</a>.</p><p><strong>➡️ </strong>Subgraph GNNs continue to be in the spotlight: two works (<a href="https://openreview.net/forum?id=2Hp7U3k5Ph"><strong>Zhang, Feng, Du, et al</strong></a> and <a href="https://openreview.net/forum?id=K07XAlzh5i"><strong>Zhou, Wang, Zhang</strong></a>) concurrently derive expressiveness hierarchies of the recently proposed subgraph GNNs and their relationship to the 1- and higher-order WL tests.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*fImGk-McVtr5SBfl-K1EsQ.jpeg" /><figcaption>Image By Author.</figcaption></figure><h3><strong>New GNN architectures: Delays and Half-hops</strong></h3><p>If you are tired of yet another variation of GCN or GAT, here are some fresh ideas that can work with any GNN of your choice:</p><p>⏳ As we know from the <strong>Theory</strong> section, rewiring helps combat over-squashing. <a href="https://openreview.net/forum?id=WEgjbJ6IDN"><strong>Gutteridge et al</strong></a> introduce <em>“DRew: Dynamically Rewired Message Passing with Delay”</em> which gradually densifies the graph in later GNN layers such that long-distance nodes see the original states of previous nodes (the original DRew) or those skip-connections are added based on the <em>delay</em> — depending on a distance between two nodes (the vDRew version). For example ( 🖼️👇), in vDRew delayed message passing, a starting node from layer 0 will show its state to 2-hop neighbors on layer 1, and will show its state to a 3-hop neighbor on layer 2. <strong>DRew</strong> significantly improves the ability of vanilla GNNs to perform long-range tasks — in fact, a DRew-enabled GCN is the current <a href="https://github.com/vijaydwivedi75/lrgb">SOTA</a> on the Peptides-func dataset from the <a href="https://github.com/vijaydwivedi75/lrgb">Long Range Graph Benchmark</a> 👀</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*hRwGmN3SxlSWvwrw" /><figcaption>Source: <a href="https://openreview.net/forum?id=WEgjbJ6IDN"><strong>Gutteridge et al</strong></a></figcaption></figure><p>🦘 Another neat idea by <a href="https://openreview.net/forum?id=lXczFIwQkv"><strong>Azabou et al</strong></a> is to slow down message passing by inserting new, <em>slow nodes</em> at each edge with a special connectivity pattern — only an incoming connection from the starting node and a symmetric edge with the destination node. Slow nodes improve the performance of vanilla GNNs on heterophilic benchmarks by a large margin, and it is also possible to use slow nodes for self-supervised learning by creating views with different locations of slow nodes for the same original graph. <strong>HalfHop</strong> is a no-brainer-to-include SSL component that boosts performance and should be in a standard suite of many GNN libraries 👍.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*sFksdRzaPbfCTGcv" /><figcaption>Source: <a href="https://openreview.net/forum?id=lXczFIwQkv"><strong>Azabou et al</strong></a></figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*kT7uap0DkYhJcVDVld1Mcw.jpeg" /><figcaption>Image By Author.</figcaption></figure><h3><strong>Generative Models — Stable Diffusion for Molecules, Discrete Diffusion</strong></h3><p><strong>➡️ </strong>Diffusion models might work in the <strong>feature</strong> space (e.g., pixel space in image generation like the original DDPM) or in the <strong>latent</strong> space (like Stable Diffusion). In the feature space, you have to design the noising process to respect symmetries and equivariances of your feature space. In the latent space, you can just add Gaussian noise to the features produced by (pre-trained) encoder. Most 3D molecule generation models work in the feature space (like a pioneering <a href="https://arxiv.org/abs/2203.17003">EDM</a>), and the new <strong>GeoLDM </strong>model by <a href="https://openreview.net/forum?id=sLfHWWrfe2">Xu et al</a> (authors of the prominent <a href="https://arxiv.org/abs/2203.02923">GeoDiff</a>) is the first to define <strong>latent</strong> diffusion for 3D molecule generation. That is, after training an EGNN autoencoder, GeoLDM is trained on the denoising objective where noise is sampled from a standard Gaussian. GeoLDM brings significant improvements over EDM and other non-latent diffusion approaches 👏.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*__ao71NOClVDCVNX" /><figcaption>GeoLDM. Source: <a href="https://openreview.net/forum?id=sLfHWWrfe2">Xu et al</a></figcaption></figure><p><strong>➡️ </strong>In the realm of non-geometric graphs (just with an adjacency and perhaps categorical node features), discrete graph diffusion pioneered by <a href="https://openreview.net/forum?id=UaAD-Nu86WX">DiGress</a> (ICLR’23) seems the most applicable option. <a href="https://openreview.net/forum?id=vn9O1N5ZOw">Chen et al</a> propose <strong>EDGE, </strong>a discrete diffusion model guided by the node degree distribution. In contrast to DiGress, the final target graph in EDGE is a disconnected graph without edges, a forward noising model removes edges through a Bernoulli distribution, and a reverse process adds edges to the most recent <em>active</em> nodes (active are the nodes whose degrees changed in the previous step). Thanks to the sparsity introduced by the degree guidance, EDGE can generate pretty large graphs up to 4k nodes and 40k edges!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*BfeJQ_1dEXVjHFUI" /><figcaption>Graph Generation with EDGE. Source:<a href="https://openreview.net/forum?id=vn9O1N5ZOw">Chen et al</a></figcaption></figure><p><strong>➡️</strong> Finally, <a href="https://openreview.net/forum?id=24wzmwrldX"><strong>“Graphically Structured Diffusion Models”</strong></a> by <em>Weilbach et al</em> bridges the gap between continuous generative models and probabilistic graphical models that induce a certain structure in the problem of interest — often such problems have a combinatorial nature. The central idea is to encode the problem’s structure as an attention mask that respects permutation invariances and use this mask in the attention computation in the Transformer encoder (which by definition is equivariant to input token permutation unless you use positional embeddings). <strong>GSDM</strong> can tackle binary continuous matrix factorization, boolean circuits, can generate sudokus, and perform sorting. Particularly enjoyable is a pinch of irony the paper is written with 🙃.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*fF-J_X5GIwsNzvBn" /><figcaption>GSDM task-to-attention-bias. Source: <a href="https://openreview.net/forum?id=24wzmwrldX"><strong>“Graphically Structured Diffusion Models”</strong></a> by <em>Weilbach et al</em></figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*y0Z_QtNemxQYlc8c7Rw_lQ.jpeg" /><figcaption>Image By Author</figcaption></figure><h3><strong>Geometric Learning: Geometric WL, Clifford Algebras</strong></h3><p>Geometric Deep Learning thrives! There were so many interesting papers presented that would take pretty much the whole post, so I’d highlight only a few.</p><p><strong>➡️ Geometric WL</strong> has finally arrived in the work by <a href="https://openreview.net/forum?id=6Ed3gchl9L">Joshi, Bodnar, et al</a>. Geometric WL extends the notion of the WL test with geometric features (e.g., coordinates or velocity) and derives the expressiveness hierarchy up to k-order GWL. Key takeaways: 1️⃣ <strong>equivariant</strong> models are more expressive than <strong>invariant </strong>(with a note that in fully connected graphs the difference disappears), 2️⃣ <strong>tensor order</strong> of features improves expressiveness, 3️⃣ <strong>body order</strong> of features improves expressiveness (see the image 👇). That is, <em>spherical &gt; cartesian &gt; scalars</em>, and <em>many-body interactions &gt; just distances</em>. The paper also features the amazing learning source <a href="https://github.com/chaitjo/geometric-gnn-dojo">Geometric GNN Dojo </a>where you can derive and implement most SOTA models from the first principles!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/812/0*n-hGGDsinOYN6XJ2" /><figcaption>Source: <a href="https://openreview.net/forum?id=6Ed3gchl9L">Joshi, Bodnar, et al</a></figcaption></figure><p><strong>➡️ </strong>Going beyond vectors to Clifford algebras, <a href="https://openreview.net/forum?id=DNAJdkHPQ5">Ruhe et al</a> derive <strong>Geometric Clifford Algebra Networks </strong>(GCANs). Clifford algebras naturally support higher-order interactions by means of bivectors, trivectors, and (in general) multivectors. The key idea is the <a href="https://en.wikipedia.org/wiki/Cartan%E2%80%93Dieudonn%C3%A9_theorem">Cartan-Dieudonné theorem</a> that every orthogonal transformation can be decomposed into <em>reflections</em> in hyperplanes, and geometric algebras represent data as the elements of the <em>Pin(p,q,r)</em> group. GCANs introduce a notion of linear layers, normalizations, non-linearities, and how they can be parameterized with neural networks. Experiments include modeling fluid dynamics and Navier-Stokes equations.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/650/0*kyUqSgPTIVUUm8uw" /><figcaption>Source: <a href="https://openreview.net/forum?id=DNAJdkHPQ5">Ruhe et al</a></figcaption></figure><p>In fact, there is already a <a href="https://arxiv.org/abs/2305.11141">follow-up work</a> introducing equivariant Clifford NNs — you can learn more about Clifford algebras foundations and the most recent papers on <a href="https://microsoft.github.io/cliffordlayers/">CliffordLayers</a> supported by Microsoft Research.</p><p>💊 <a href="http://proceedings.mlr.press/v139/satorras21a/satorras21a.pdf">Equivariant GNN</a> (EGNN) is the Aspirin of Geometric DL that gets applied to almost every task and has seen quite a number of improvements. <a href="https://openreview.net/forum?id=hF65aKF8Bf"><strong>Eijkelboom et al</strong></a> marry EGNN with <a href="https://arxiv.org/abs/2103.03212">Simplicial networks</a> that operate on higher-order structures (namely, simplicial complexes) into <strong>EMPSN</strong>. This is one of the first examples that combines geometric and topological features and has great improvement potential! Finally, <a href="https://openreview.net/forum?id=QIejMwU0r9"><strong>Passaro and Zitnick</strong></a> derive a neat trick to reduce SO(3) convolutions to SO(2) bringing the complexity down from O(L⁶) to O(L³) but with mathematical equivalence guarantees 👀. This finding allows to scale up geometric models on larger datasets like OpenCatalyst and already made it to <a href="https://arxiv.org/abs/2306.12059">Equiformer V2</a> — soon in many other libraries for geometric models 😉</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*2_bpmLIX0tw2-Vdc1ae99A.jpeg" /><figcaption>Image By Author.</figcaption></figure><h3><strong>Molecules: 2D-3D pretraining, Uncertainty Estimation in MD</strong></h3><p><strong>➡️ </strong><a href="https://openreview.net/forum?id=mPEVwu50th">Liu, Du, et al</a> propose <strong>MoleculeSDE</strong>, a new framework for joint 2D-3D pretraining on molecular data. In addition to standard contrastive loss, the authors add two <strong>generative</strong> components: reconstructing 2D -&gt; 3D and 3D -&gt; 2D inputs through the score-based diffusion generation. Using standard GIN and SchNet as 2D and 3D models, MoleculeSDE is pre-trained on PCQM4M v2 and performs well on downstream fine-tuning tasks.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*u3joJ4vJRpFYvsyE" /><figcaption>Source: <a href="https://github.com/chao1224/MoleculeSDE">MoleculeSDE Github repo</a></figcaption></figure><p><strong>➡️ </strong><a href="https://openreview.net/forum?id=DjwMRloMCO">Wollschläger et al</a> perform a comprehensive study of Uncertainty Estimation in GNNs for molecular dynamics and force fields. Identifying key physics-informed and application-focused principles, the authors propose a<strong> Localized Neural Kernel</strong>, a Gaussian Process-based extension to any geometric GNN that works on invariant and equivariant quantities (tried on SchNet, DimeNet, and NequIP). In many cases, LNK’s estimations from one model are on par with or better than costly ensembling where you’d need to train several models.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*aivuPyYtR7cISIgU" /><figcaption>Source: <a href="https://openreview.net/forum?id=DjwMRloMCO">Wollschläger et al</a></figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*-W3DZTEpB90bBJA9iWHCJw.jpeg" /><figcaption>Image By Author.</figcaption></figure><h3><strong>Materials &amp; Proteins: CLIP for proteins, Ewald Message Passing, Equivariant Augmentations</strong></h3><p>CLIP and its descendants have become a standard staple in text-to-image models. Can we do the same but for text-to-protein? Yes!</p><p><strong>➡️</strong> <a href="https://openreview.net/forum?id=ZOOwHgxfR4">Xu, Yuan, et al</a> present <strong>ProtST</strong>, a framework for learning joint representations of text protein descriptions (via PubMedBERT) and protein sequences (via ESM). In addition to a contrastive loss, ProtST has a multimodal mask prediction objective, e.g., masking 15% of tokens in text and protein sequence, and predicting those jointly based on latent representations, and mask prediction losses based on sequences or language alone. Additionally, the authors design a novel <strong>ProtDescribe</strong> dataset with 550K aligned protein sequence-description pairs. <strong>ProtST</strong> excels across many protein modeling tasks in the <a href="https://github.com/DeepGraphLearning/PEER_Benchmark"><strong>PEER</strong></a> benchmark, including protein function annotation and localization, but also allows for zero-shot protein retrieval right from the textual description (see an example below). Looks like <strong>ProtST</strong> has a bright future of being a backbone behind many protein generative models 😉</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*pgmc_mGxaf9DPudO" /><figcaption>Source: <a href="https://openreview.net/forum?id=ZOOwHgxfR4">Xu, Yuan, et al</a></figcaption></figure><p>Actually, ICML features several protein generation works like <strong>GENIE</strong> by <a href="https://openreview.net/forum?id=4Kw5hKY8u8">Lin and AlQuraishi</a> and <strong>FrameDiff</strong> by <a href="https://openreview.net/forum?id=m8OUBymxwv">Yim, Trippe, De Bortoli, Mathieu, et al</a> — those are not yet conditioned on textual descriptions, so incorporating ProtST there looks like a no-brainer performance boost 📈.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/655/0*5PaTCWvaqmnO_lOM" /><figcaption>Gif Source: <a href="https://github.com/jasonkyuyim/se3_diffusion">SE(3) Diffusion Github</a></figcaption></figure><p>⚛️ MPNNs on molecules have a strict locality bias that inhibits modeling long-range interactions. <a href="https://openreview.net/forum?id=vd5JYAml0A">Kosmala et al</a> derive <strong>Ewald Message Passing</strong> and apply the idea of <a href="https://en.wikipedia.org/wiki/Ewald_summation">Ewald summation</a> that breaks down the interaction potential into short-range and long-range terms. Short-range interaction is modeled by any GNN while long-range interaction is new and is modeled with a <strong>3D Fourier transform</strong> and message passing over Fourier frequencies. Turns out this long-range term is pretty flexible and can be applied to any network modeling periodic and aperiodic systems (like crystals or molecules) like SchNet, DimeNet, or GemNet. The model was evaluated on OC20 and OE62 datasets. If you are interested in more details, check out the <a href="https://www.youtube.com/watch?v=Ip8EGde5SUQ">1-hour talk by Arthur Kosmala</a> at the LOG2 Reading Group!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/964/0*L4cdBaxl24Pmf01A" /><figcaption>Source: <a href="https://openreview.net/forum?id=vd5JYAml0A">Kosmala et al</a></figcaption></figure><p>A similar idea of using Ewald summation for 3D crystals is used in <strong>PotNet</strong> by <a href="https://openreview.net/forum?id=jxI4CulNr1">Lin et al</a> where the long-range connection is modeled with incomplete Bessel functions. PotNet was evaluated on the Materials Project dataset and JARVIS — so reading those two papers you can have a good understanding of the benefits brought by Ewald summation for many crystal-related tasks 😉</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*KEUAd7BKENXAR4s9" /><figcaption>Source: <a href="https://openreview.net/forum?id=jxI4CulNr1">Lin et al</a></figcaption></figure><p><strong>➡️ </strong>Another look at imbuing <em>any</em> GNNs with equivariance for crystals and molecules is given by <a href="https://openreview.net/forum?id=HRDRZNxQXc">Duval, Schmidt, et al</a> in <strong>FAENet</strong>. A standard way is to bake certain symmetries and equivariances right into GNN architectures (like in EGNN, GemNet, and Ewald Message Passing) — this is a safe but computationally expensive way (especially when it comes to spherical harmonics and tensor products). Another option often used in vision — show many augmentations of the same input and the model should eventually learn the same invariances in the augmentations. The authors go for the 2nd path and design a rigorous way to sample 2D / 3D data invariant or equivariant augmentations (e.g., for energy or forces, respectively) all with fancy proofs ✍️. For that, the data augmentation pipeline includes projecting 2D / 3D inputs to a canonical representation (based on PCA of the covariance matrix of distances) from which we can uniformly sample rotations.</p><p>The proposed FAENet is a simple model that uses only distances but shows very good performance with the stochastic frame averaging data augmentation while being 6–20 times faster. Works for crystal structures as well!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*YbuNYIce-QQQIxr1" /><figcaption>Augmentations and Stochastic Frame Averaging. Source: <a href="https://openreview.net/forum?id=HRDRZNxQXc">Duval, Schmidt, et al</a></figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*OfI1perhAGhr8Ds2SQXObg.jpeg" /><figcaption>Image By Author.</figcaption></figure><h3><strong>Cool Applications: Algorithmic Reasoning, Inductive KG Completion, GNNs for Mass Spectra</strong></h3><p>A few papers in this section did not belong to any of the above but are still worthy of your attention.</p><p><strong>➡️ </strong><a href="https://openreview.net/forum?id=kP2p67F4G7"><strong>”Neural Algorithmic Reasoning with Causal Regularisation”</strong></a> by <em>Bevilacqua et al</em> tackles a common issue in graph learning — OOD generalization to larger inputs at test time. Studying OOD generalization in algorithmic reasoning problems, the authors observe that there exist many different inputs that make identical computations at a certain step. At the same time, it means that some subset of inputs does not (should not) affect the prediction result. This assumption allows to design a self-supervised objective (termed <strong>Hint-ReLIC</strong>) that prefers a “meaningful” step to a bunch of steps that do not affect the prediction result. The new objective significantly bumps the performance on many CLRS-30 tasks to 90+% micro-F1. It is an interesting question whether we could leverage the same principle in general message passing and improve OOD transfer in other graph learning tasks 🤔</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*M5SClb3OOdATfJfg" /><figcaption>Source: <a href="https://openreview.net/forum?id=kP2p67F4G7"><strong>”Neural Algorithmic Reasoning with Causal Regularisation”</strong></a> by <em>Bevilacqua et al</em></figcaption></figure><p>If you are further interested in neural algorithmic reasoning, check out the proceedings of the <a href="https://klr-icml2023.github.io/papers.html">Knowledge and Logical Reasoning workshop</a> which has even more works on that topic.</p><p><strong>➡️</strong> <a href="https://openreview.net/forum?id=OoOpO0u4Xd"><strong>“InGram: Inductive Knowledge Graph Embedding via Relation Graphs”</strong></a> by <em>Lee et al</em> seems to be one of the very few knowledge graph papers at ICML’23 (to the best of my search). <strong>InGram</strong> is one of the first approaches that can inductively generalize to both unseen entities and <strong>unseen relations</strong> at test time. Previously, inductive KG models needed to learn at least relation embeddings in some form to generalize to new nodes, and in this paradigm, new unseen relations are non-trivial to model. InGram builds a relation graph on top of the original multi-relational graph, that is, a graph of relation types, and learns representations of relations based on this graph by running a GAT. Entity representations are obtained from the random initialization and a GNN encoder. Having both entity and relation representations, a DistMult decoder is applied as a scoring function. There are good chances that InGram for unseen relations might be as influential as <a href="http://proceedings.mlr.press/v119/teru20a/teru20a.pdf">GraIL (ICML 2020)</a> for unseen entities 😉.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*ULJqTgk7Lny16lyx" /><figcaption>Source: <a href="https://openreview.net/forum?id=OoOpO0u4Xd"><strong>“InGram: Inductive Knowledge Graph Embedding via Relation Graphs”</strong></a> by <em>Lee et al</em></figcaption></figure><p>🌈 <a href="https://openreview.net/forum?id=81RIPI742h"><strong>”Efficiently predicting high resolution mass spectra with graph neural networks”</strong></a> by <em>Murphy et al</em> is a cool application of GNNs to a real physics problem of predicting mass spectra. The main finding is that most of the signal in mass spectra is explained by a small number of components (product ion and neutral loss <em>formulas</em>). And it is possible to mine a vocabulary of those <em>formulas</em> from training data. The problem can thus be framed as graph classification (or graph property prediction) when, given a molecular graph, we predict tokens from a vocabulary that correspond to certain mass spectra values. The approach, <strong>GRAFF-MS</strong>, builds molecular graph representation through GIN with edge features, with Laplacian features (via SignNet), and pooled with covariate features. Compared to the baseline CFM-ID, GRAFF-MS performs inference in ~19 minutes compared to 126 hours reaching much higher performance 👀.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/808/0*FU9DgD5kmDyuQGF_" /><figcaption>Source: <a href="https://openreview.net/forum?id=81RIPI742h"><strong>”Efficiently predicting high resolution mass spectra with graph neural networks”</strong></a> by <em>Murphy et al</em></figcaption></figure><h3>The Concluding Meme Part</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*nS0x2VdufT_dkY1Xesrrtw.jpeg" /><figcaption>Four Michaels (+ epsilon in the background) on the same photo!</figcaption></figure><p>The meme of 2022 has finally converged to <a href="https://michael-bronstein.medium.com/">Michael Bronstein</a>!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=9b5e4306a1cc" width="1" height="1" alt=""><hr><p><a href="https://medium.com/data-science/graph-machine-learning-icml-2023-9b5e4306a1cc">Graph Machine Learning @ ICML 2023</a> was originally published in <a href="https://medium.com/data-science">TDS Archive</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Neural Graph Databases]]></title>
            <link>https://medium.com/data-science/neural-graph-databases-cc35c9e1d04f?source=rss-4d4f8ddd1e68------2</link>
            <guid isPermaLink="false">https://medium.com/p/cc35c9e1d04f</guid>
            <category><![CDATA[graph-machine-learning]]></category>
            <category><![CDATA[editors-pick]]></category>
            <category><![CDATA[knowledge-graph]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[machine-learning]]></category>
            <dc:creator><![CDATA[Michael Galkin]]></dc:creator>
            <pubDate>Tue, 28 Mar 2023 03:18:26 GMT</pubDate>
            <atom:updated>2023-03-28T14:08:15.825Z</atom:updated>
            <content:encoded><![CDATA[<h4>What’s New in Graph ML?</h4><h4>A new milestone in graph data management</h4><p>We introduce the concept of Neural Graph Databases as the next step in the evolution of graph databases. Tailored for large incomplete graphs and on-the-fly inference of missing edges using graph representation learning, neural reasoning maintains high expressiveness and supports complex logical queries similar to standard graph query languages.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*PCbnNVWsqdVx-XWdeAphZA.png" /><figcaption>Image by Authors, assisted by Stable Diffusion.</figcaption></figure><p><em>This post was written together with </em><a href="http://hyren.me/"><em>Hongyu Ren</em></a><em>, </em><a href="https://www.cochez.nl/"><em>Michael Cochez</em></a><em>, and </em><a href="https://kiddozhu.github.io/"><em>Zhaocheng Zhu</em></a><em> based on our newest paper </em><a href="https://arxiv.org/abs/2303.14617"><em>Neural Graph Reasoning: Complex Logical Query Answering Meets Graph Databases</em></a><em>. You can also follow </em><a href="https://twitter.com/michael_galkin"><em>me</em></a><em>, </em><a href="https://twitter.com/ren_hongyu"><em>Hongyu</em></a><em>, </em><a href="https://twitter.com/michaelcochez"><em>Michael</em></a><em>, and </em><a href="https://twitter.com/zhu_zhaocheng"><em>Zhaocheng</em></a><em> on Twitter. Check our </em><a href="https://www.ngdb.org/"><em>project website</em></a><em> for more materials.</em></p><h3><strong>Outline</strong>:</h3><ol><li>Neural Graph Databases: What and Why?</li><li>The blueprint of NGDBs</li><li>Neural Graph Storage</li><li>Neural Query Engine</li><li>Neural Graph Reasoning for Query Engines</li><li>Open Challenges for NGDBs</li><li>Learn More</li></ol><h3>Neural Graph Databases: What and Why?</h3><p>🍨Vanilla graph databases are pretty much everywhere thanks to the ever-growing graphs in production, flexible graph data models, and expressive query languages. Classical, symbolic graph DBs are fast and cool under one important assumption:</p><blockquote>Completeness. Query engines assume that graphs in classical graph DBs are complete.</blockquote><p>Under the completeness assumption, we can build indexes, store the graphs in a variety of read/write-optimized formats and expect the DB would return <strong>what is there</strong>.</p><p>But this assumption does not often hold in practice (we’d say, doesn’t hold way too often). If we look at some prominent knowledge graphs (KGs): in Freebase, 93.8% of people have no place of birth and <a href="https://aclanthology.org/P09-1113.pdf">78.5% have no nationality</a>, about 68% of people <a href="https://dl.acm.org/doi/abs/10.1145/2566486.2568032">do not have any profession</a>, while in Wikidata, about <a href="https://arxiv.org/abs/2207.00143">50% of artists have no date of birth</a>, and only <a href="https://dl.acm.org/doi/abs/10.1145/3485447.3511932">0.4% of known buildings have information about height</a>. And that’s for the largest KG openly curated by hundreds of enthusiasts. Surely, 100M nodes and 1B statements are not the largest ever graph in the industry, so you can imagine the degree of incompleteness there.</p><p>Clearly, to account for incompleteness, in addition to <strong>“what is there?”</strong> we have to also ask <strong>“what is missing?” </strong>(or “what can be there?”). Let’s look at the example:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*QWL4YTqmNdpalZlq" /><figcaption>(a) - input query; (b) — incomplete graph with predicted edges (dashed lines); (c) — a SPARQL query returning one answer (UofT) via graph traversal; (d) — neural execution that recovers missing edges and returns two new answers (UdeM, NYU). Image by Authors.</figcaption></figure><p>Here, given an incomplete graph (edges (Turing Award, win, Bengio) and (Deep Learning, field, LeCun) are missing) and a query <em>“At what universities do the Turing Award winners in the field of Deep Learning work?”</em> (expressed in a logical form or in some language like SPARQL), a symbolic graph DB would return only one answer <strong>UofT</strong> reachable by graph traversal. We refer to such answers as <em>easy</em> answers, or existing answers. Accounting for missing edges, we would recover two more answers <strong>UdeM</strong> and <strong>NYU</strong> (<em>hard</em> answers, or inferred answers).</p><p>How to infer missing edges?</p><ul><li>In classical DBs, we don’t have much choice. RDF-based databases have some formal semantics and can be backed by hefty OWL ontologies but, depending on graph size and complexity of inference, it might take an infinite amount of time to complete the inference in <a href="https://www.w3.org/TR/sparql11-entailment/">SPARQL entailment regimes</a>. Labeled Property Graph (LPG) graph databases do not have built-in means for inferring missing edges at all.</li><li>Thanks to the advances in Graph Machine Learning, we can often perform link prediction in a latent (embedding) space in linear time! We can then extend this mechanism to executing complex, database-like queries right in the embedding space.</li></ul><blockquote>Neural Graph Databases combine the advantages of traditional graph DBs with modern graph machine learning.</blockquote><p>That is, DB principles like (1) graphs as a first-class citizen, (2) efficient storage, and (3) uniform querying interface are now backed by Graph ML techniques such as (1) geometric representations, (2) robustness to noisy inputs, (3) large-scale pretraining and fine-tuning in order to bridge the incompleteness gap and enable neural graph reasoning and inference.</p><p>In general, the design principles for NGDBs are:</p><ul><li>The <strong>data incompleteness assumption</strong> — the underlying data might have missing information on node-, link-, and graph-levels which we would like to infer and leverage in query answering;</li><li><strong>Inductiveness and updatability</strong> — similar to traditional databases that allow updates and instant querying, representation learning algorithms for building graph latents have to be inductive and generalize to unseen data (new entities and relation at inference time) in the zero-shot (or few-shot) manner to prevent costly re-training (for instance, of shallow node embeddings);</li><li><strong>Expressiveness</strong> — the ability of latent representations to encode logical and semantic relations in the data akin to FOL (or its fragments) and leverage them in query answering. Practically, the set of supported logical operators for neural reasoning should be close to or equivalent to standard graph database languages like SPARQL or Cypher;</li><li><strong>Multimodality</strong> beyond knowledge graphs — any graph-structured data that can be stored as a node or record in classical databases (consisting, for example, of images, texts, molecular graphs, or timestamped sequences) and can be imbued with a vector representation is a valid source for the Neural Graph Storage and Neural Query Engine.</li></ul><p>The key methods to address the NGDB principles are:</p><ul><li><strong>Vector representation as the atomic element</strong> — while traditional graph DBs hash the adjacency matrix (or edge list) in many indexes, the incompleteness assumption implies that both given edges <strong>and</strong> graph latents (vector representations) become the <em>sources of truth</em> in the <em>Neural Graph Storage</em>;</li><li><strong>Neural query execution in the latent space</strong> –, basic operations such as edge traversal cannot be performed solely symbolically due to the incompleteness assumption. Instead, the <em>Neural Query Engine</em> operates on both adjacency and graph latents to incorporate possibly missing data into query answering;</li></ul><p>In fact, by answering queries in the latent space (and not sacrificing traversal performance) we can ditch symbolic database indexes altogether.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*7xl5Q_4mlU0ApsWH" /><figcaption>The main difference between symbolic graph DBs and neural graph DBs: traditional DBs answer the question “What is there?” by edge traversal while neural graph DBs also answer “What is missing?”. Image by Authors.</figcaption></figure><h3>The Blueprint of NGDBs</h3><p>Before diving into NGDBs, let’s take a look at <strong>neural databases</strong> in general — turns out they have been around for a while and you might have noticed that. Many machine learning systems already operate in this paradigm when data is encoded into model parameters and querying is equivalent to a forward pass that can output a new representation or prediction for a downstream task.</p><h4><strong>Neural Databases: Overview</strong></h4><p>What is the current state of neural databases? What are the differences between its kinds and what’s special about NGDBs?</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*8p7deom_uieSxICU" /><figcaption>Differences between Vector DBs, natural language DBs, and neural graph DBs. Image by Authors</figcaption></figure><ol><li><strong>Vector databases</strong> belong to the family of storage-oriented systems commonly built around approximate nearest neighbor libraries (ANN) like <a href="https://github.com/facebookresearch/faiss">Faiss</a> or <a href="https://github.com/google-research/google-research/tree/master/scann">ScaNN</a> (or custom solutions) to answer distance-based queries using Maximum Inner-Product Search (MIPS), L1, L2, or other distances. Being encoder-independent (that is, any encoder yielding vector representations can be a source like a ResNet or BERT), vector databases are fast but lack complex query answering capabilities.</li><li>With the recent rise of large-scale pretrained models — or, <a href="https://en.wikipedia.org/wiki/Foundation_models">foundation models</a> — we have witnessed their huge success in natural language processing and computer vision tasks. We argue that such foundation models are also a prominent example of neural databases. There, the <em>storage module</em> might be presented directly with model parameters or outsourced to an external index often used in <a href="https://arxiv.org/abs/2002.08909">retrieval-augmented models</a> since encoding all world knowledge even into billions of model parameters is hard. The <em>query module</em> performs in-context learning either via filling in the blanks in encoder models (BERT or T5 style) or via prompts in decoder-only models (GPT-style) that can span multiple modalities, eg,<a href="https://arxiv.org/abs/2205.10337"> learnable tokens for vision applications</a> or even <a href="https://arxiv.org/abs/2302.07842">calling external tools</a>.</li><li><strong>Natural Language Databases (NLDB)</strong> introduced by <a href="https://arxiv.org/abs/2106.01074">Thorne et al</a> model atomic elements as textual facts encoded to a vector via a pre-trained language model (LM). Queries to NLDB are sent as natural language utterances that get encoded to vectors and query processing employs the <em>retriever-reader</em> approach.</li></ol><p>Neural Graph Databases is not a novel term — many graph ML approaches tried to combine graph embeddings with database indexes, perhaps <a href="http://rdf2vec.org/">RDF2Vec</a> and <a href="https://openreview.net/forum?id=p0sMj8oH2O">LPG2Vec</a> are some of the most prominent examples how embeddings can be plugged into <strong>existing</strong> graph DBs and run on top of symbolic indexes.</p><p>In contrast, we posit that NGDBs can <strong>work without symbolic indexes</strong> right in the latent space. As we show below, there exist ML algorithms that can simulate exact edge traversal-like behavior in embedding space to retrieve “<strong>what is there</strong>” as well as perform neural reasoning to answer “<strong>what is missing</strong>”.</p><h4><strong>Neural Graph Databases: Architecture</strong></h4><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*GUH8w_Djovcv-32OHLpoJQ.png" /><figcaption>A conceptual scheme of Neural Graph Databases. An input query is processed by the Neural Query Engine where the Planner derives a computation graph of the query and the Executor executes the query in the latent space. The Neural Graph Storage employs Graph Store and Feature Store to obtain latent representations in the Embedding Store. The Executor communicates with the embedding store to retrieve and return results. Image by Authors</figcaption></figure><p>On a higher level, NGDB contains two main components, <strong>Neural Graph Storage</strong> and <strong>Neural Query Engine</strong>. The query answering pipeline starts with the query sent by some application or downstream task already in a structured format (obtained, for example, via <a href="https://arxiv.org/abs/2209.15003">semantic parsing</a> if an initial query is in natural language to transform it into a structured format).</p><p>The query first arrives to the Neural Query Engine, and, in particular, to the <em>Query Planner </em>module. The task of the Query Planner is to derive an efficient computation graph of atomic operations (projections and logical operations) with respect to the query complexity, prediction tasks, and underlying data storage such as possible graph partitioning.</p><p>The derived plan is then sent to the <em>Query Executor</em> that encodes the query in a latent space, executes the atomic operations over the underlying graph and its latent representations, and aggregates the results of atomic operations into a final answer set. The execution is done via the <em>Retrieval</em> module that communicates with the <em>Neural Graph Storage</em>.</p><p>The storage layer consists of</p><p>1️⃣ <em>Graph Store</em> for keeping the multi-relational adjacency matrix in space- and time-efficient manner (eg, in various sparse formats like COO and CSR);</p><p>2️⃣ <em>Feature Store</em> for keeping node- and edge-level multimodal features associated with the underlying graph.</p><p>3️⃣ <em>Embedding Store</em> that leverages an Encoder module to produce graph representations in a latent space based on the underlying adjacency and associated features.</p><p>The Retrieval module queries the encoded graph representations to build a distribution of potential answers to atomic operations.</p><h3><strong>Neural Graph Storage</strong></h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*hyFody8Wft9I5097NGAfQQ.png" /><figcaption>In traditional graph DBs (right), queries are optimized into a plan (often, a tree of join operators) and executed against the storage of DB indexes. In Neural Graph DBs (left), we encode the query (or its steps) in a latent space and execute against the latent space of the underlying graph. Image by Authors.</figcaption></figure><p>In traditional graph DBs, storage design often depends on the graph modeling paradigm.</p><p>The two most popular paradigms are Resource Description Framework (RDF) graphs and Labeled Property Graphs (LPG). We posit, however, that the new <a href="https://w3c.github.io/rdf-star/cg-spec/editors_draft.html">RDF-star</a> (and accompanying SPARQL-star) is going to unify those two merging logical expressiveness of RDF graphs with attributed nature of LPG. Many existing KGs already follow the RDF-star (-like) paradigm like <a href="https://towardsdatascience.com/representation-learning-on-rdf-and-lpg-knowledge-graphs-6a92f2660241">hyper-relational KGs</a> and <a href="https://www.wikidata.org/wiki/Help:Statements">Wikidata Statement Model</a>.</p><blockquote>If we are to envision the backbone graph modeling paradigm in the next years, we’d go for RDF-star.</blockquote><p>In the Neural Graph Storage, both the input graph and its vector representations are sources of truth. For answering queries in the latent space, we need:</p><ul><li>Query Encoder</li><li>Graph Encoder</li><li>Retrieval mechanism to match query representation against the graph representation</li></ul><p>The graph encoding (embedding) process can be viewed as a compression step but the semantic and structure similarity of entities/relations is kept. The distance between entities/relations in the embedding space should be positively correlated with the semantic/structure similarity. There are many options for the architecture of the encoder — and we recommend sticking to <strong>inductive</strong> ones to adhere to the NGDB design principles. In our recent <a href="https://arxiv.org/abs/2210.08008">NeurIPS 2022 work</a>, we presented two such inductive models.</p><p>Query encoding is usually matched with the nature graph encoding such that both of them will be in the same space. Once we have latent representations, the Retrieval module kicks in to extract relevant answers.</p><p>The retrieval process can be seen as a nearest neighbor search of the input vector in the embedding space and has 3 direct benefits:</p><ol><li>Confidence scores for each retrieved item — thanks to a predefined distance function in the embedding space</li><li>Different definitions of the latent space and the distance function — catering for different graphs, eg, tree-like graphs are easier to work in hyperbolic spaces</li><li>Efficiency and scalability — retrieval scales to extremely large graphs with billions of nodes and edges</li></ol><h3><strong>Neural Query Engine</strong></h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*GE2qmJuYDPt1E8wz85xHdQ.png" /><figcaption>Query planning in NGDBs (left) and traditional graph DBs (right). The NGDB planning (assuming incomplete graphs) can be performed autoregressively step-by-step (1) or generated entirely in one step (2). The traditional DB planning is cost-based and resorts to metadata (assuming complete graphs and extracted from them) such as the number of intermediate answers to build a tree of join operators. Image by Authors</figcaption></figure><p>In traditional DBs, a typical query engine performs three major operations. (1) <strong>Query parsing</strong> to verify syntax correctness (often enriched with a deeper semantic analysis of query terms); (2) <strong>Query planning</strong> and optimization to derive an efficient query plan (usually, a tree of relational operators) that minimizes computational costs; (3) <strong>Query execution</strong> that scans the storage and processes intermediate results according to the query plan.</p><p>It is rather straightforward to extend those operations to NGDBs.</p><p>1️⃣ Query Parsing can be achieved via semantic parsing to a structured query format. We intentionally leave the discussion on a query language for NGDBs for future works and heated public discussions 😉</p><p>2️⃣ Query Planner derives an efficient query plan of atomic operations (projections and logical operators) maximizing completeness (all answers over existing edges must be returned) and inference (of missing edges predicted on the fly) taking into account query complexity and underlying graph.</p><p>3️⃣ Once the query plan is finalized, the Query Executor encodes the query (or its parts) into a latent space, communicates with the Graph Storage and its Retrieval module, and aggregates intermediate results into the final answer set. There exist two common mechanisms for query execution:</p><ul><li><em>Atomic</em>, resembling traditional DBs, when a query plan is executed sequentially by encoding atomic patterns, retrieving their answers, and executing logical operators as intermediate steps;</li><li><em>Global</em>, when the entire query graph is encoded and executed in a latent space in one step.</li></ul><p>The main challenge for neural query execution is matching query expressiveness to that of symbolic languages like SPARQL or Cypher — so far, neural methods can execute queries close to First-Order Logic expressiveness, but we are somewhat halfway there to symbolic languages.</p><h3>A Taxonomy of Neural Graph Reasoning for Query Engines</h3><p>The literature on neural methods for complex logical query answering (aka, <em>query embedding</em>) has been growing since 2018 and the seminal NeurIPS work of <a href="https://proceedings.neurips.cc/paper/7473-embedding-logical-queries-on-knowledge-graphs">Hamilton et al</a> on <strong>Graph Query Embedding</strong> (GQE). GQE was able to answer conjunctive queries with intersections and predict missing links on the fly.</p><blockquote>GQE can be considered as the first take on Neural Query Engines for NGDBs.</blockquote><p>GQE started the whole subfield of Graph Machine Learning followed by some prominent examples like <a href="https://openreview.net/forum?id=BJgr4kSFDS">Query2Box (ICLR 2020)</a> and <a href="https://openreview.net/forum?id=Mos9F9kDwkz">Continuous Query Decomposition (ICLR 2021)</a>. We undertook a major effort categorizing all those (about 50) works along 3 main directions:</p><p>⚛️ <strong>Graphs</strong> — what is the underlying structure against which we answer queries;<br>🛠️ <strong>Modeling</strong> — how we answer queries and which inductive biases are employed;<br>🗣️ <strong>Queries</strong> — what we answer, what are the query structures and what are the expected answers.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*C-u4BxyrHFnq_GQhZTtH2w.png" /><figcaption>The taxonomy of neural approaches for complex logical query answering. See the &lt;paper&gt; for more details. Image by Authors</figcaption></figure><p>⚛️ Talking about <strong>Graphs</strong>, we further break them down into <strong>Modality</strong> (classic triple-only graphs, hyper-relational graphs, hypergraphs, and more), <strong>Reasoning Domain</strong> (discrete entities or including continuous outputs), and <strong>Semantics</strong> (how neural encoders capture higher-order relationships like OWL ontologies).</p><p>🛠️ In <strong>Modeling</strong>, we follow the Encoder-Processor-Decoder paradigm classifying inductive biases of existing models, eg, transductive or inductive encoders with neural or neuro-symbolic processors.</p><p>🗣️ In <strong>Queries</strong>, we aim at mapping the set of queries answerable by neural methods with that of symbolic graph query languages. We talk about <strong>Query Operators</strong> (going beyond standard And/Or/Not), <strong>Query Patterns</strong> (from chain-like queries to DAGs and cyclic patterns), and <strong>Projected Variables</strong> (your favorite relational algebra ).</p><h3>Open Challenges for NGDBs</h3><p>Analyzing the taxonomy we find that there is no silver bullet at the moment, eg, most processors can only work in discrete mode with tree-based queries. But it also means there is a lot of room for future work — possibly your contribution!</p><p>To be more precise, here are the main NGDB challenges for the following years.</p><p>Along the <strong>Graph</strong> branch:</p><ul><li><strong>Modality</strong>: Supporting more graph modalities: from classic triple-only graphs to hyper-relational graphs, hypergraphs, and multimodal sources combining graphs, texts, images, and more.</li><li><strong>Reasoning Domain</strong>: Supporting logical reasoning and neural query answering over temporal and continuous (textual and numerical) data — literals constitute a major portion of graphs as well as relevant queries over literals.</li><li><strong>Background Semantics</strong>: Supporting complex axioms and formal semantics that encode higher-order relationships between (latent) classes of entities and their hierarchies, \eg, enabling neural reasoning over description logics and OWL fragments.</li></ul><p>In the <strong>Modeling</strong> branch:</p><ul><li><strong>Encoder</strong>: Inductive encoders supporting unseen relation at inference time — this a key for (1) <em>updatability</em> of neural databases without the need of retraining; (2) enabling the <em>pretrain-finetune</em> strategy generalizing query answering to custom graphs with custom relational schema.</li><li><strong>Processor</strong>: Expressive processor networks able to effectively and efficiently execute complex query operators akin to SPARQL and Cypher operators. Improving sample efficiency of neural processors is crucial for the <em>training time vs quality</em> tradeoff — reducing training time while maintaining high predictive qualities.</li><li><strong>Decoder</strong>: So far, all neural query answering decoders operate exclusively on discrete nodes. Extending the range of answers to continuous outputs is crucial for answering real-world queries.</li><li><strong>Complexity</strong>: As the main computational bottleneck of processor networks is the dimensionality of embedding space (for purely neural models) and/or the number of nodes (for neuro-symbolic), new efficient algorithms for neural logical operators and retrieval methods are the key to scaling NGDBs to billions of nodes and trillions of edges.</li></ul><p>In <strong>Queries</strong>:</p><ul><li><strong>Operators</strong>: Neuralizing more complex query operators matching the expressiveness of declarative graph query languages, e.g., supporting Kleene plus and star, property paths, filters.</li><li><strong>Patterns</strong>: Answering more complex patterns beyond tree-like queries. This includes DAGs and cyclic graphs.</li><li><strong>Projected Variables</strong>: Allowing projecting more than a final leaf node entity, that is, allowing returning intermediate variables, relations, and multiple variables organized in tuples (bindings).</li><li><strong>Expressiveness</strong>: Answering queries outside simple EPFO and EFO fragments and aiming for the expressiveness of database languages.</li></ul><p>Finally, in <strong>Datasets</strong> and <strong>Evaluation</strong>:</p><ul><li>The need for larger and <strong>diverse benchmarks</strong> covering more graph modalities, more expressive query semantics, more query operators, and query patterns.</li><li>As the existing evaluation protocol appears to be limited (focusing only on inferring <em>hard</em> answers) there is a need for a more <strong>principled evaluation framework and metrics</strong> covering various aspects of the query answering workflow.</li></ul><p>Pertaining to the Neural Graph Storage and NGDB in general, we identify the following challenges:</p><ul><li>The need for a <strong>scalable retrieval</strong> mechanism to scale neural reasoning to graphs of billions of nodes. Retrieval is tightly connected to the Query Processor and its modeling priors. Existing scalable ANN libraries can only work with basic L1, L2, and cosine distances that limit the space of possible processors in the neural query engine.</li><li>Currently, all complex query datasets provide a hardcoded query execution plan that might not be optimal. There is a need for a <strong>neural query planner </strong>that would transform an input query into an optimal execution sequence taking into account prediction tasks, query complexity, type of the neural processor, and configuration of the Storage layer.</li></ul><p>Due to encoder inductiveness and updatability without retraining, there is a need to alleviate the issues of <strong>continual learning</strong>, <strong>catastrophic forgetting</strong>, and <strong>size generalization</strong> when running inference on much larger graphs than training ones.</p><h3>Learn More</h3><p>NGDB is still an emerging concept with many open challenges for future research. If you want to learn more about NGDB, feel free to check out</p><ul><li>📜 our paper (<a href="https://arxiv.org/abs/2303.14617">arxiv</a>),</li><li>🌐 <a href="https://www.ngdb.org/">our website</a>,</li><li>🔧 <a href="https://github.com/neuralgraphdatabases/awesome-logical-query">GitHub repo</a> with the most up-to-date list of relevant papers, datasets, and categorization, feel free to open issues and PRs.</li></ul><p>We will also be organizing workshops, stay tuned for the updates!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=cc35c9e1d04f" width="1" height="1" alt=""><hr><p><a href="https://medium.com/data-science/neural-graph-databases-cc35c9e1d04f">Neural Graph Databases</a> was originally published in <a href="https://medium.com/data-science">TDS Archive</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Graph ML in 2023: The State of Affairs]]></title>
            <link>https://medium.com/data-science/graph-ml-in-2023-the-state-of-affairs-1ba920cb9232?source=rss-4d4f8ddd1e68------2</link>
            <guid isPermaLink="false">https://medium.com/p/1ba920cb9232</guid>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[editors-pick]]></category>
            <category><![CDATA[deep-learning]]></category>
            <category><![CDATA[graph-machine-learning]]></category>
            <category><![CDATA[machine-learning]]></category>
            <dc:creator><![CDATA[Michael Galkin]]></dc:creator>
            <pubDate>Sun, 01 Jan 2023 17:58:15 GMT</pubDate>
            <atom:updated>2023-01-02T15:00:16.979Z</atom:updated>
            <content:encoded><![CDATA[<h4>STATE OF THE ART DIGEST</h4><h4>Hot trends and major advancements</h4><p>2022 comes to an end and it is about time to sit down and reflect upon the achievements made in Graph ML as well as to hypothesize about possible breakthroughs in 2023. Tune in 🎄☕</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*84YLhhHBsT6blINloyhyww.png" /><figcaption>Background image generated by <a href="https://openai.com/dall-e-2/">DALL-E 2</a>, text added by Author.</figcaption></figure><p><em>The article is written together with </em><a href="http://hyren.me/"><em>Hongyu Ren</em></a><em> (Stanford University), </em><a href="https://kiddozhu.github.io/"><em>Zhaocheng Zhu</em></a><em> (Mila &amp; University of Montreal). We thank </em><a href="https://chrsmrrs.github.io/"><em>Christopher Morris</em></a><em> and </em><a href="https://www.microsoft.com/en-us/research/people/johannesb/"><em>Johannes Brandstetter</em></a><em> for the feedback and helping with the Theory and PDE sections, respectively. Follow </em><a href="https://twitter.com/michael_galkin"><em>Michael</em></a><em>, </em><a href="https://twitter.com/ren_hongyu"><em>Hongyu</em></a><em>, </em><a href="https://twitter.com/zhu_zhaocheng"><em>Zhaocheng</em></a>, <a href="https://twitter.com/chrsmrrs"><em>Christopher</em></a><em>, and </em><a href="https://twitter.com/jo_brandstetter"><em>Johannes</em></a> <em>here on Medium and Twitter for more graph ml-related discussions.</em></p><p><strong>Table of Contents:</strong></p><ol><li><a href="#48f6">Generative Models: Denoising Diffusion for Molecules and Proteins</a></li><li><a href="#d2e8">DFTs, ML Force Fields, Materials, and Weather Simulations</a></li><li><a href="#6d20">Geometry &amp; Topology &amp; PDEs</a></li><li><a href="#8e6c">Graph Transformers</a></li><li><a href="#ca19">BIG Graphs</a></li><li><a href="#7986">GNN Theory: Weisfeiler and Leman Go Places, Subgraph GNNs</a></li><li><a href="#e5e6">Knowledge Graphs: Inductive Reasoning Takes Over</a></li><li><a href="#b2f5">Algorithmic Reasoning and Alignment</a></li><li><a href="#0de4">Cool GNN Applications</a></li><li><a href="#b813">Hardware: IPUs and Graphcore win OGB LSC 2022</a></li><li><a href="#9b59">New Conferences: LoG and Molecular ML</a></li><li><a href="#41dc">Courses and Educational Materials</a></li><li><a href="#3e6d">New Datasets, Benchmarks, and Challenges</a></li><li><a href="#463f">Software Libraries and Open Source</a></li><li><a href="#1b30">Join the Community</a></li><li><a href="#7593">The Meme of 2022</a></li></ol><h3>Generative Models: Denoising Diffusion for Molecules and Proteins</h3><p>Generative diffusion models in the vision-language domain were the headline topic in the Deep Learning world in 2022. While generating images and videos is definitely a cool playground to try out different models and sampling techniques, we’d argue that</p><blockquote>the most <em>useful</em> applications of diffusion models in 2022 were actually created in the Geometric Deep Learning area focusing on molecules and proteins</blockquote><p>In our recent article, we were pondering whether <a href="https://towardsdatascience.com/denoising-diffusion-generative-models-in-graph-ml-c496af5811c5">“Denoising Diffusion Is All You Need?”</a>.</p><p><a href="https://towardsdatascience.com/denoising-diffusion-generative-models-in-graph-ml-c496af5811c5">Denoising Diffusion Generative Models in Graph ML</a></p><p>There, we reviewed newest generative models for <em>graph generation </em>(DiGress), <em>molecular conformer generation</em> (EDM, GeoDiff, Torsional Diffusion), <em>molecular docking</em> (DiffDock), <em>molecular linking</em> (DiffLinker), and <em>ligand generation</em> (DiffSBDD). As soon as the post went public, several amazing protein generation models were released:</p><p><a href="https://www.generatebiomedicines.com/chroma"><strong>Chroma</strong></a> from Generate Biomedicines allows to impose functional and geometric constraints, and even use natural language queries like “Generate a protein with CHAD domain” thanks to a small GPT-Neo trained on protein captioning;</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*i89iIQ_WEmvxzrWk" /><figcaption><em>Chroma protein generation. Source: </em><a href="https://www.generatebiomedicines.com/chroma"><em>Generate Biomedicines</em></a></figcaption></figure><p><a href="https://www.bakerlab.org/2022/11/30/diffusion-model-for-protein-design/"><strong>RoseTTaFold Diffusion</strong></a> (RF Diffusion) from the Baker Lab and MIT is packed with the similar functionality also allowing for text prompts like “Generate a protein that binds to X” as well as being capable of functional motif scaffolding, scaffolding enzyme active sites, and <em>de novo</em> protein design. Strong point: 1000 designs generated with RF Diffusion were experimentally <a href="https://twitter.com/DaveJuergens/status/1601675072175239170">synthesized and tested</a> in the lab!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*GlqRk3ixfMQbByvJ" /><figcaption><em>RF Diffusion. Source: </em><a href="https://www.biorxiv.org/content/10.1101/2022.12.09.519842v1"><em>Watson et al.</em></a><em> BakerLab</em></figcaption></figure><p>The Meta AI FAIR team made amazing progress in protein design purely with language models: mid-2022, <a href="https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1"><strong>ESM-2</strong></a> was released, a protein LM trained solely on protein sequences that outperforms ESM-1 and other baselines by a huge margin. Moreover, it was then shown that encoded LM representations are a very good starting point for obtaining the actual geometric configuration of a protein without the need for Multiple Sequence Alignments (MSAs) — this is done via <a href="https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1"><strong>ESMFold</strong></a>. A big shoutout to Meta AI and FAIR for publishing the model and the weights: it is available in the <a href="https://github.com/facebookresearch/esm">official GitHub repo</a> and <a href="https://huggingface.co/models?other=esm">on HuggingFace</a> as well!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*t13I9JF7RwJFQnP6" /><figcaption>Scaling ESM-2 leads to better folding prediction. Source: <a href="https://www.biorxiv.org/content/10.1101/2022.07.20.500902v1">Lin, Akin, Rao, Hie et al</a></figcaption></figure><p>🍭 Later on, even more goodies arrived from the ESM team: <a href="https://www.biorxiv.org/content/10.1101/2022.12.21.521521v1">Verkuil et al.</a> find that ESM-2 can generate <em>de novo</em> protein sequences that can actually be synthesized in the lab and, more importantly, do not have any match among known natural proteins. <a href="https://www.biorxiv.org/content/10.1101/2022.12.21.521526v1">Hie et al.</a> propose pretty much a new programming language for protein designers (think of it as a query language for ESMFold) — production rules organized in a syntax tree with constraint functions. Then, each program is “compiled” into an energy function that governs the generative process. Meta AI also released the biggest <a href="https://esmatlas.com/">Metagenomic Atlas</a>, but more on that in the <strong>Datasets</strong> section of this article.</p><p>In the antibody design area, a similar LM-based approach is taken by <strong>IgLM</strong> by <a href="https://www.biorxiv.org/content/10.1101/2021.12.13.472419v2">Shuai, Ruffolo, and Gray</a>. IGLM generates antibody sequences conditioned on chain and species id tags.</p><p>Finally, we’d highlight a few works from Jian Tang’s lab at Mila. <strong>MoleculeSTM</strong> by <a href="https://arxiv.org/abs/2212.10789">Liu et al.</a> is a CLIP-like text-to-molecule model (plus a new large pre-training dataset). MoleculeSTM can do 2 impressive things: (1) retrieve molecules by text description like “triazole derivatives” and retrieve text description from a given molecule in SMILES, (2) molecule editing from text prompts like “make the molecule soluble in water with low permeability” — and the model edits the molecular graph according to the description, mindblowing 🤯</p><p>Then, <strong>ProtSEED</strong> by <a href="https://arxiv.org/abs/2210.08761">Shi et al.</a> is a generative model for protein sequence <em>and</em> structure simultaneously (for example, most existing diffusion models for proteins can do only one of those at a time). ProtSEED can be conditioned on residue features or pairs of residues. Model-wise, it is an equivariant iterative model with improved triangular attention. ProtSEED was evaluated on Antibody CDR co-design, Protein sequence-structure co-design, and Fixed backbone sequence design.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*KF8K16TQmLpgoCM0" /><figcaption>Molecule editing from text inputs. Source: <a href="https://arxiv.org/abs/2212.10789">Liu et al.</a></figcaption></figure><p>Besides generating the protein structures, there are also some works for generating protein sequences from structures, known as inverse folding. Don’t forget to check out the <a href="https://www.biorxiv.org/content/10.1101/2022.04.10.487779v2">ESM-IF1</a> from Meta and the <a href="https://www.science.org/doi/full/10.1126/science.add2187">ProteinMPNN</a> from the Baker Lab.</p><blockquote><strong>What to expect in 2023</strong>: (1) performance improvements of diffusion models such as faster sampling and more efficient solvers; (2) more powerful conditional protein generation models; (3) more successful applications of <a href="https://arxiv.org/abs/2111.09266">Generative Flow Networks</a> (GFlowNets, check out the <a href="https://milayb.notion.site/The-GFlowNet-Tutorial-95434ef0e2d94c24aab90e69b30be9b3">tutorial</a>) to molecules and proteins.</blockquote><h3><strong>DFTs, ML Force Fields, Materials, and Weather Simulations</strong></h3><p>AI4Science becomes the frontier of equivariant GNN research and its applications. Pairing GNNs with PDEs, we can now tackle much more complex prediction tasks.</p><blockquote>In 2022, this frontier expanded to ML-based <strong>Density Functional Theory</strong> (DFT) and <strong>Force fields</strong> approximations used for <strong>molecular dynamics</strong> and <strong>material discovery.</strong> The other growing field is <strong>Weather simulations</strong>.</blockquote><p>We would recommend the <a href="https://www.youtube.com/watch?v=t7q_ZNrBghY">talk</a> by Max Welling for a broader overview of AI4Science and what is now enabled by using Deep Learning in science.</p><p>Starting with models, 2022 has seen a surge in equivariant GNNs for molecular dynamics and simulations, e.g., building upon <a href="https://arxiv.org/abs/2101.03164">NequIP</a>, <strong>Allegro</strong> by <a href="https://arxiv.org/abs/2204.05249">Musaelian, Batzner, et al.</a> or <strong>MACE</strong> by <a href="https://arxiv.org/abs/2206.07697">Batatia et al.</a> The design space for such models is very large, so refer to the recent survey by <a href="https://arxiv.org/abs/2205.06643">Batatia, Batzner, et al.</a> for an overview. A crucial component for most of them is the <a href="https://github.com/e3nn/e3nn"><strong>e3nn</strong></a> library (paper by <a href="https://arxiv.org/abs/2207.09453">Geiger and Smidt</a>) and the notion of tensor product. We highly recommend a great <a href="https://uvagedl.github.io/">new course</a> by Erik Bekkers on Group Equivariant Deep Learning to understand the mathematical foundations and catch up with the recent papers.</p><p>⚛️ <strong>Density Functional Theory</strong> (DFT) calculations are one of the main workhorses of molecular dynamics (and account for a great deal of computing time in big clusters). DFT is O(n³) to the input size though, so can ML help here? In <em>Learned Force Fields Are Ready For Ground State Catalyst Discovery,</em> <a href="https://arxiv.org/abs/2209.12466">Schaarschmidt et al.</a> present the experimental study of models of learned potentials — turns out GNNs can do a very good job in linear O(n) time! The <strong>Easy Potentials</strong> approach (trained on Open Catalyst data) turns out to be quite a good predictor especially when paired with a postprocessing step. Model-wise, it is an MPNN with the <a href="https://arxiv.org/abs/2106.07971">Noisy Nodes</a> self-supervised objective.</p><p>In <strong>Forces are not Enough</strong>, <a href="https://arxiv.org/abs/2210.07237">Fu et al.</a> introduce a new benchmark for molecular dynamics — in addition to MD17, the authors add datasets on modeling liquids (Water), peptides (Alanine dipeptide), and solid-state materials (LiPS). More importantly, the authors consider a wide range of physical properties like stability of simulations, diffusivity, and radial distribution functions. Most SOTA molecular dynamics models were probed including SchNet, ForceNet, DimeNet, GemNet (-T and -dT), and NequIP.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*qnxr3F6pBTefNkOD" /><figcaption>Source: <a href="https://arxiv.org/abs/2210.07237">Fu et al.</a></figcaption></figure><p>In crystal structure modeling, we’d highlight <strong>Equivariant Crystal Networks</strong> by <a href="https://openreview.net/forum?id=0Dh8dz4snu">Kaba and Ravanbakhsh</a> — a neat way to build representations of periodic structures with crystalline symmetries. Crystals can be described with <em>lattices</em> and <em>unit cells</em> with basis vectors that are subject to group transformations. Conceptually, ECN creates edge index masks corresponding to symmetry groups, performs message passing over this masked index, and aggregates the results of many symmetry groups.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*yaL_yc3yHKGTh4NS" /><figcaption>Source: <a href="https://openreview.net/forum?id=0Dh8dz4snu">Kaba and Ravanbakhsh</a></figcaption></figure><p>Even more news on material discovery can found in the proceedings of the recent <a href="https://sites.google.com/view/ai4mat">AI4Mat NeurIPS workshop</a>!</p><p>☂️ ML-based weather forecasting made a huge progress as well. In particular, <a href="https://arxiv.org/abs/2212.12794"><strong>GraphCast</strong></a> by DeepMind and <a href="https://arxiv.org/abs/2211.02556"><strong>Pangu-Weather</strong></a> by Huawei demonstrated exceptionally good results outperforming traditional models by a large margin. While Pangu-Weather leverages 3D/visual inputs and Visual Transformers, GraphCast employs a mesh MPNN where Earth is split into several hierarchy levels of meshes. The deepest level has about 40K nodes with 474 input features and the model outputs 227 predicted variables. The MPNN follows the “encoder-processor-decoder” and has 16 layers. GraphCast is autoregressive model w.r.t. the next timestep prediction, that is, it takes previous two states and predicts the next one. GraphCast can build a 10-day forecast in &lt;60 seconds on a single TPUv4 and is much more accurate than non-ML forecasting models. 👏</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*xwM2nBv3OGjw-OhY" /><figcaption>Encoder-Processor-Decoder mesh MPNN in GraphCast. Source: <a href="https://arxiv.org/abs/2212.12794">Lam, Sanchez-Gonzalez, Willson, Wirnsberger, Fortunato, Pritzel, et al.</a></figcaption></figure><blockquote><strong>What to expect in 2023</strong>: We expect to see a lot more focus on computational efficiency and scalability of GNNs. Current GNN-based force-fields are obtaining remarkable accuracy, but are still 2–3 orders of magnitude slower than classical force-fields and are typically only deployed on a few hundred atoms. For GNNs to truly have a transformative impact on materials science and drug discovery, we will see many folks tackling this issue, be it through architectural advances or smarter sampling.</blockquote><h3>Geometry &amp; Topology &amp; PDEs</h3><p>In 2022, 1️⃣ we got a better understanding of oversmoothing and oversquashing phenomena in GNNs and their connections to algebraic topology; 2️⃣ using GNNs for PDE modeling is now mainstream.</p><p>1️⃣ Michael Bronstein’s lab made huge contributions to this problem — check those excellent posts on Neural Sheaf Diffusion and framing GNNs as gradient flows</p><p><a href="https://towardsdatascience.com/neural-sheaf-diffusion-for-deep-learning-on-graphs-bfa200e6afa6">Neural Sheaf Diffusion for deep learning on graphs</a></p><p>And on GNNs as gradient flows:</p><p><a href="https://towardsdatascience.com/graph-neural-networks-as-gradient-flows-4dae41fb2e8a">Graph Neural Networks as gradient flows</a></p><p>2️⃣ Using GNNs for PDE modeling became a mainstream topic. Some papers require the 🤯 <strong>math alert</strong> 🤯 warning, but if you are familiar with the basics of ODEs and PDEs it should be much easier.</p><p><em>Message Passing Neural PDE Solvers</em> by <a href="https://openreview.net/forum?id=vSix3HPYKSU">Brandstetter, Worrall, and Welling</a> describe how message passing can help solving PDEs, generalize better, and get rid of manual heuristics. Furthermore, MP-PDEs representationally contain classic solvers like finite differences.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*FtAIW7ScbxGnUVyJ" /><figcaption>Source: <a href="https://openreview.net/forum?id=vSix3HPYKSU">Brandstetter, Worrall, and Welling</a></figcaption></figure><p>The topic was developed further by many recent works including continuous forecasting with implicit neural representations (<a href="https://arxiv.org/abs/2209.14855">Yin et al.</a>), supporting mixed boundary conditions (<a href="https://openreview.net/forum?id=B3TOg-YCtzo">Horie and Mitsume</a>), or latent evolution of PDEs (<a href="https://arxiv.org/abs/2206.07681">Wu et al.</a>)</p><blockquote><strong>What to expect in 2023</strong>: Neural PDEs and their applications are likely to expand to more physics-related AI4Science subfields, where especially computational fluid dynamics (CFD) will potentially be influenced by GNN based surrogates in the coming months. Classical CFD is applied to a wide range of research and engineering problems in many fields of study, including aerodynamics, hypersonic and environmental engineering, fluid flows, visual effects in video games, or weather simulations as discussed above. GNN based surrogates might augment/replace traditional well-tried techniques such as finite element methods (<a href="https://arxiv.org/abs/2203.08852">Lienen et al.</a>), remeshing algorithms (<a href="https://arxiv.org/abs/2204.11188">Song et al.</a>), boundary value problems (<a href="https://arxiv.org/abs/2206.14092">Loetsch et al.</a>), or interactions with triangularized boundary geometries (<a href="https://arxiv.org/abs/2106.11299">Mayr et al.</a>).</blockquote><blockquote>The neural PDE community is starting to build strong and commonly used baselines and frameworks, which will in return help to accelerate the progress, e.g. <strong>PDEBench</strong> (<a href="https://arxiv.org/abs/2210.07182">Takamoto et al.</a>) or <strong>PDEArena</strong> (<a href="https://arxiv.org/abs/2209.15616">Gupta et al.</a>)</blockquote><h3>Graph Transformers</h3><p>Definitely one of the main community drivers in 2022, <strong>graph transformers</strong> (GTs) evolved a lot towards higher effectiveness and better scalability. Several outstanding models published in 2022:</p><p><strong>👑 GraphGPS</strong> by <a href="https://arxiv.org/abs/2205.12454">Rampášek et al.</a> takes the title of <strong>“GT of 2022”</strong> thanks to combining local message passing, global attention (optionally, linear for higher efficiency), and positional encodings that led to setting a new SOTA on ZINC and many other benchmarks. Check out a dedicated article on GraphGPS</p><p><a href="https://towardsdatascience.com/graphgps-navigating-graph-transformers-c2cc223a051c">GraphGPS: Navigating Graph Transformers</a></p><p>GraphGPS served as a backbone of <strong>GPS++,</strong> the <a href="https://ogb.stanford.edu/neurips2022/results/#winners_pcqm4mv2">winning</a> OGB Large Scale Challenge 2022 model on PCQM4M v2 (graph regression). <strong>GPS++</strong>, <a href="https://arxiv.org/abs/2212.02229">created by</a> Graphcore, Valence Discovery, and Mila, incorporates more features including 3D coordinates and leverages sparse-optimized IPU hardware (more on that in the following section). GPS++ weights are already <a href="https://github.com/graphcore/ogb-lsc-pcqm4mv2">available</a> on GitHub!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*jdDh-eGvvW8trjJP" /><figcaption>GraphGPS intuition. Source: <a href="https://arxiv.org/abs/2205.12454">Rampášek et al</a></figcaption></figure><p><strong>Transformer-M</strong> by <a href="https://arxiv.org/abs/2210.01765">Luo et al.</a> inspired many top OGB LSC models as well. Transformer-M adds 3D coordinates to the neat mix of joint 2D-3D pre-training. At inference time, when 3D info is not known, the model would infer a glimpse of 3D knowledge which improves the performance on PCQM4Mv2 by a good margin. Code is <a href="https://github.com/lsj2408/Transformer-M">available</a> either.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*H-WW5aJ5kcQkgHzH" /><figcaption><em>Transformer-M joint 2D-3D pre-training scheme. Source: </em><a href="https://arxiv.org/abs/2210.01765"><em>Luo et al.</em></a></figcaption></figure><p><strong>TokenGT </strong>by <a href="https://arxiv.org/abs/2207.02505">Kim et al</a> goes even more explicit and adds all edges of the input graph (in addition to all nodes) to the sequence fed to the Transformer. With those inputs, encoder needs additional token types to distinguish nodes from edges. The authors prove several nice theoretical properties (although at the cost of higher computational complexity O((V+E)²) that can get to the 4th power in the worst case of a fully-connected graph). Code is <a href="https://github.com/jw9730/tokengt">available</a>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*zBtdEqj9_J67pzZv" /><figcaption>TokenGT adds both nodes and edges to the input sequence. Source: <a href="https://arxiv.org/abs/2207.02505">Kim et al</a></figcaption></figure><blockquote><strong>What to expect in 2023</strong>: for the coming year, we’d expect 1️⃣ GTs to scale up along the axes of both data and model parameters, from molecules of &lt;50 nodes to graphs of millions of nodes, in order to witness the emergent behavior as in text &amp; vision foundation models 2️⃣ similar to <a href="https://huggingface.co/bigscience/bloom">BLOOM</a> by the BigScience Initiative, a big open-source pre-trained equivariant GT for molecular data, perhaps within the <a href="https://m2d2.io/opendrugdiscovery/">Open Drug Discovery</a> project.</blockquote><h3>BIG Graphs</h3><p>🔥 One of our favorites in 2022 is <em>“Graph Neural Networks for Link Prediction with Subgraph Sketching</em>” by <a href="https://arxiv.org/abs/2209.15486">Chamberlain, Shirobokov et al.</a> — this is a neat combination of algorithms + ML techniques. It is known that <a href="https://arxiv.org/pdf/2010.16103.pdf">SEAL</a>-like labeling tricks dramatically improve link prediction performance compared to standard GNN encoders but suffer from big computation/memory overhead. In this work, the authors find that obtaining distances from two nodes of a query edge can be efficiently done with hashing (<a href="https://en.wikipedia.org/wiki/MinHash">MinHashing</a>) and cardinality estimation (<a href="https://en.wikipedia.org/wiki/HyperLogLog">HyperLogLog</a>) algorithms. Essentially, message passing is done over <em>minhashing</em> and <em>hyperloglog</em> initial sketches of single nodes (<em>min</em> aggregation for minhash, <em>max</em> for hyperloglog sketches) — this is the core of the <strong>ELPH</strong> link prediction model (with a simple MLP decoder). The authors then design a more scalable <strong>BUDDY</strong> model where k-hop hash propagation can be precomputed before training. Experimentally, ELPH and BUDDY scale to large graphs that were previously way too large or resource hungry for labeling trick approaches. Great work and definitely a solid baseline for all future link prediction models! 👏</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*97QlEb5fjGMQX_y3" /><figcaption>The motivation behind computing subgraph hashes to estimate cardinalities of neighborhoods and intersections. Source: <a href="https://arxiv.org/abs/2209.15486">Chamberlain, Shirobokov et al.</a></figcaption></figure><p>On the graph sampling and minibatching side, <a href="https://openreview.net/forum?id=b9g0vxzYa_">Gasteiger, Qian, and Günnemann</a> design <a href="https://github.com/tum-daml/ibmb"><strong>Influence-based Mini-Batching (IBMB)</strong></a>, a good example how Personalized PageRank (PPR) can solve even graph batching! IBMB aims at creating the smallest minibatches whose nodes have the maximum influence on the node classification task. In fact, the influence score is equivalent to PPR. Practically, given a set of target nodes, IBMB (1) partitions the graph into permanent clusters, (2) runs PPR within each batch to select top-PPR nodes that would form a final subgraph minibatch. The resulting minibatches can be sent to any GNN encoder. IBMB is pretty much <strong>constant</strong> O(1) to the graph size where partitioning and PPRs can be precomputed at the pre-processing stage.</p><p>Although the resulting batches are fixed and do not change over training (not stochastic enough), the authors design momentum-like optimization terms to mitigate this non-stochasticity. IBMB can be used both in training and inference with massive speedups — up to 17x and 130x, respectively 🚀</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*o8nHPFJ7qlbHG0T7" /><figcaption>Influence-based mini-batching. Source: <a href="https://openreview.net/forum?id=b9g0vxzYa_">Gasteiger, Qian, and Günnemann</a></figcaption></figure><p>The subtitle of this subsection could be “<em>brought to you by Google</em>” since the majority of the papers have authors from Google ;)</p><p><a href="https://openreview.net/pdf?id=q5h7Ywx-sS">Carey et al.</a> created <strong><em>Stars</em></strong>, a method for building sparse similarity graphs at the scale of <strong>tens of trillions</strong> of edges 🤯. Pairwise N² comparisons would obviously not work here — Stars employs two-hop <a href="https://en.wikipedia.org/wiki/Geometric_spanner">spanner graphs</a> (those are the graphs where similar points are connected with at most two hops) and <a href="http://infolab.stanford.edu/~bawa/Pub/similarity.pdf">SortingLSH</a> that together enable almost linear time complexity and high sparsity.</p><p><a href="https://openreview.net/pdf?id=LpgG0C6Y75">Dhulipala et al.</a> created <strong>ParHAC</strong>, an approximate (1+𝝐) parallel algorithm for hierarchical agglomerative clustering (HAC) on very large graphs and extensive theoretical foundations of the algorithm. ParHAC has O(V+E) complexity and poly-log depth and runs up to 60x faster than baselines on graphs with<strong> hundreds of billions</strong> of edges (here it is the Hyperlink graph with 1.7B nodes and 125B edges).</p><p><a href="https://openreview.net/pdf?id=ldl2V3vLZ5">Devvrit et al.</a> created <strong>S³GC</strong>, a scalable self-supervised graph clustering algorithm with one-layer GNN and constrastive training objective. S³GC uses both graph structure and node features and scales to graphs of up to 1.6B edges.</p><p>Finally, <a href="https://openreview.net/forum?id=Fhty8PgFkDo">Epasto et al.</a> created a differentially-private modification of PageRank!</p><p>LoG 2022 featured two tutorials on large-scale GNNs: <a href="https://www.youtube.com/watch?v=HRC4hZKiUWU">Scaling GNNs in Production</a> by Da Zheng, Vassilis N. Ioannidis, and Soji Adeshina and <a href="https://www.youtube.com/watch?v=e2jJU7u7si0">Parallel and Distributed GNNs</a> by Torsten Hoefler and Maciej Besta.</p><blockquote><strong>What to expect in 2023</strong>: further reduction in compute costs and inference time for very large graphs. Perhaps models for OGB LSC graphs could run on commodity machines instead of huge clusters?</blockquote><h3>GNN Theory: Weisfeiler and Leman Go Places, Subgraph GNNs</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*nETwu-WX5ejpbS4q" /><figcaption>Tourists of the year! Source of the original portraits: <a href="https://towardsdatascience.com/towards-geometric-deep-learning-iv-chemical-precursors-of-gnns-11273d74125">Towards Geometric Deep Learning IV: Chemical Precursors of GNNs</a> by Michael Bronstein</figcaption></figure><p>🏖 🌄 Weisfeiler and Leman, grandfathers of Graph ML and GNN theory, had a very prolific traveling year! After visiting <a href="https://ojs.aaai.org/index.php/AAAI/article/view/4384">Neural</a>, <a href="https://proceedings.neurips.cc/paper/2020/file/f81dee42585b3814de199b2e88757f5c-Paper.pdf">Sparse</a>, <a href="http://proceedings.mlr.press/v139/bodnar21a/bodnar21a.pdf">Topological</a>, and <a href="https://proceedings.neurips.cc/paper/2021/file/157792e4abb490f99dbd738483e0d2d4-Paper.pdf">Cellular</a> places in previous years, in 2022 we have seen them in several new places:</p><ul><li>WL Go <strong>Machine Learning</strong> — a comprehensive survey by <a href="https://arxiv.org/abs/2112.09992">Morris et al</a> on the basics of the WL test, terminology, and various applications;</li><li>WL Go <strong>Relational</strong> — the first attempt by <a href="https://arxiv.org/abs/2211.17113">Barcelo et al</a> to study expressiveness of relational GNNs used in multi-relational graphs and KGs. Turns out R-GCN and CompGCN are equally expressive and are bounded by the Relational 1-WL test, and the most expressive message function (aggregating entity-relation representations) is a Hadamard product;</li><li><a href="https://arxiv.org/abs/2205.10914">WL Go Walking by Niels M. Kriege</a> studies expressiveness of random walk kernels and finds that the RW kernel (with a small modification) is as expressive as a WL subtree kernel;</li><li>WL Go <strong>Geometric</strong>: <a href="https://openreview.net/forum?id=kXe4Y0c4VqT">Joshi, Bodnar et al</a> propose Geometric WL test (GWL) to study expressiveness of equivariant and invariant GNNs (to ceratin symmetries: translation, rotation, reflection, permutation). Turns out, equivariant GNNs (such as <a href="https://arxiv.org/abs/2102.09844">E-GNN</a>, <a href="https://arxiv.org/abs/2101.03164">NequIP</a> or <a href="https://arxiv.org/abs/2206.07697">MACE</a>) are provably more powerful than invariant GNNs (such as <a href="https://proceedings.neurips.cc/paper/2017/file/303ed4c69846ab36c2904d3ba8573050-Paper.pdf">SchNet</a> or <a href="https://arxiv.org/abs/2011.14115">DimeNet</a>);</li><li>WL Go <strong>Temporal</strong>: <a href="https://openreview.net/pdf?id=MwSXgQSxL5s">Souza et al</a> propose Temporal WL test to study expressiveness of temporal GNNs. The authors then propose a novel injective aggregation function (and the PINT model) that should be most expressive;</li><li>WL Go <strong>Gradual</strong>: <a href="https://openreview.net/forum?id=fe1DEN1nds">Bause and Kriege</a> propose to modify the original WL color refinement with a non-injective function where different multi-sets <em>might</em> get assigned the same color (under certain conditions). It thus enables more gradual color refinement and slower convergence to stable coloring that eventually retains expressiveness of 1-WL but gets a few distinguishing properties on the way.</li><li>WL Go <strong>Infinite</strong>: <a href="https://arxiv.org/abs/2201.13410">Feldman et al</a> propose to change the initial node coloring with spectral features derived from the heat kernel of the Laplacian or with k-smallest eigenvectors of the Laplacian (for large graphs) which is quite close to Laplacian Positional Encodings (LPEs).</li><li>WL Go <strong>Hyperbolic</strong>: <a href="https://arxiv.org/abs/2211.02501">Nikolentzos et al</a> note that the color refinement procedure of the WL test produces a tree hierarchy of colors. In order to preserve relative distances of nodes encoded by those colors, the authors propose to map output states of each layer/iteration into a hyperbolic space and update it after each next layer. The final embeddings are supposed to retain the notion of node distances.</li></ul><p>📈 In the realm of more expressive (than 1-WL) architectures, subgraph GNNs are the biggest trend. Among those, three approaches stand out: 1️⃣ <strong>Subgraph Union Networks</strong> (SUN) by <a href="https://arxiv.org/abs/2206.11140">Frasca, Bevilacqua, et al.</a> that provide a comprehensive analysis of subgraph GNNs design space and expressiveness showing they are bounded by 3-WL; 2️⃣ <strong>Ordered Subgraph Aggregation Networks </strong>(OSAN) by <a href="https://arxiv.org/abs/2206.11168">Qian, Rattan, et al</a> devise a hierarchy of subgraph-enhanced GNNs (k-OSAN) and find that k-OSAN are incomparable to k-WL but are strictly limited by (k+1)-WL. One particularly cool part of OSAN is using <a href="https://arxiv.org/abs/2106.01798">Implicit MLE</a> (NeurIPS’21), a differentiable discrete sampling technique, for sampling ordered subgraphs. <strong>️3️⃣ SpeqNets </strong>by <a href="https://arxiv.org/abs/2203.13913">Morris et al.</a> devise a permutation-equivariant hierarchy of graph networks that balances between scalability and expressivity. 4️⃣<strong> GraphSNN</strong> by <a href="https://openreview.net/pdf?id=uxgg9o7bI_3">Wijesinghe and Wang</a> derives expressive models based on the overlap of <em>subgraph</em> isomorphisms and <em>subtree</em> isomorpishms.</p><p>🤔 A few works rethink the WL framework as a general means for GNN expressiveness. <a href="https://openreview.net/pdf?id=wIzUeM3TAU">Geerts and Reutter</a> define <strong>k-order MPNNs</strong> that can be characterized with Tensor Languages (with a mapping between WL and <strong>Tensor Languages</strong>). A new <a href="https://openreview.net/forum?id=r9hNv76KoT3">anonymous ICLR’23 submission</a> proposes to leverage <a href="https://en.wikipedia.org/wiki/Biconnected_component">graph biconnectivity</a> and defines a <strong>Generalized Distance WL</strong> algorithm.</p><p>If you’d like to study the topic even deeper, check out a wonderful <a href="https://www.youtube.com/watch?v=ASQYjbUBYzs&amp;list=PL2iNJC54likoqgKwpFnbBik8Im1sZ27Hm&amp;index=7">LOG 2022 tutorial</a> by Fabrizio Frasca, Beatrice Bevilacqua, and Haggai Maron with practical examples!</p><blockquote><strong>What to expect in 2023</strong>: <em>1️⃣</em> More efforts on creating time- and memory-efficient subgraph GNNs. <em>2️⃣</em> Better understanding of generalization of GNNs. <em>3️⃣</em> Weisfeiler and Leman visit 10 new places!</blockquote><h3>Knowledge Graphs: Inductive Reasoning Takes Over</h3><p>Last year, we observed a major shift in KG representation learning: transductive-only approaches are being actively retired in favor of inductive models that can build meaningful representation for new, unseen nodes and perform node classification and link prediction.</p><p>In 2022, the field was expanding along two main axes: 1️⃣ inductive link prediction (LP) 2️⃣ and inductive (multi-hop) query answering that extends link prediction to much more complex prediction tasks.</p><p>1️⃣ In link prediction, the majority of inductive models (like <a href="https://arxiv.org/abs/2106.06935"><strong>NBFNet</strong></a> or <a href="https://arxiv.org/abs/2106.12144"><strong>NodePiece</strong></a>) transfer to unseen nodes at inference by assuming that the set of relation types is fixed during training and does not change over time so they can learn relation embeddings. What happens when the set of relations changes as well? In the hardest case, we’d want to transfer to KGs with completely different nodes <strong>and</strong> relation types.</p><p>So far, all such models supporting unseen relations resort to meta-learning which is slow and resource-hungry. In 2022, for the first time, <a href="https://openreview.net/forum?id=LvW71lgly25">Huang, Ren, and Leskovec</a> proposed the Connected Subgraph Reasoner (<strong>CSR</strong>) framework that is inductive along <strong>both</strong> entities and relation types <strong>and</strong> does not need any meta-learning! 👀 Generally, for new relations at inference, models see at least <em>k</em> example triples with this relation (hence, a k-shot learning scenario). Conceptually, CSR extracts subgraphs around each example trying to learn common relational patterns (i.e., optimizing edge masks) and then apply the mask to the query subgraph (with the missing target link to predict).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*jfJWjM8iNDjX_1GP" /><figcaption>Inductive CSR that supports KGs with unseen entities and relation types. Source: <a href="https://openreview.net/forum?id=LvW71lgly25">Huang, Ren, and Leskovec</a></figcaption></figure><p><strong>ReFactor GNNs </strong>by <a href="https://openreview.net/forum?id=81LQV4k7a7X">Chen et al.</a> is another insightful work on inductive qualities of shallow KG embedding models — particularly, the authors find that shallow factorization models like DistMult resemble infinitely deep GNNs when looking through the lens of backpropagation and how nodes update their representations from neighboring and non-neighboring nodes. Turns out that, theoretically, any factorization model can be turned into an inductive model!</p><p>2️⃣ Inductive representation learning arrived in the area of complex logical query answering as well. (shameless plug) In fact, it was one of the focuses of our team this year 😊 First, in <a href="https://arxiv.org/abs/2205.10128">Zhu et al.</a>, we found that Neural Bellman-Ford nets generalize well from simple link prediction to complex query answering tasks in a new <a href="https://github.com/DeepGraphLearning/GNN-QE"><strong>GNN Query Executor</strong></a> (GNN-QE) model where a GNN based on NBF-Net performs relation projections while other logical operators are performed via fuzzy logic <a href="https://en.wikipedia.org/wiki/T-norm">t-norms</a>. Then, in <a href="https://openreview.net/forum?id=-vXEN5rIABY">Inductive Logical Query Answering in Knowledge Graphs</a> we studied ⚗️ <em>the essence of inductiveness</em> ⚗️ and proposed two ways to answer logical queries over unseen entities at inference time, that is, via (1) inductive node representations obtained with NodePiece encoder paired with the inference-only decoder (less performant but scalable) or via (2) inductive relational structure representations akin to the one in GNN-QE (better quality but more resource-hungry and hard to scale). Overall we are able to scale to an inductive query setting on graphs <strong>with millions of nodes and 500k unseen nodes and 5m unseen edges</strong> during inference.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*XaCPV1Io68go-we8" /><figcaption>Inductive logical query answering approaches: via node representations (NodePiece-QE) and relational structure representations (GNN-QE). Source: <a href="https://arxiv.org/abs/2210.08008">Galkin et al.</a></figcaption></figure><p>The other cool work in the area is <a href="https://github.com/google-research/smore"><strong>SMORE</strong></a><strong> </strong>by <a href="https://arxiv.org/abs/2110.14890">Ren, Dai, et al.</a> — it is a large-scale (transductive-only yet) system for complex query answering over very large graphs scaling up to the full Freebase with about 90M nodes and 300M edges 👀. In addition to CUDA, training, and pipeline optimizations, SMORE implements a bidirectional query sampler such that training queries can be generated on-the-fly right in the data loader instead of creating and storing huge datasets. Don’t forget to check out a <a href="https://www.youtube.com/watch?v=kzWV57qJmiA&amp;list=PL2iNJC54likoqgKwpFnbBik8Im1sZ27Hm&amp;index=1">fresh hands-on tutorial</a> on large-scale graph reasoning from LOG 2022!</p><p>Last but not the least, <a href="https://arxiv.org/pdf/2209.08858.pdf">Yang, Lin and Zhang</a> brought up an interesting paper rethinking the evaluation of knowledge graph completion. They point out knowledge graphs tend to be open-world (i.e., there are facts not encoded by the knowledge graph) rather close-world assumed by most works. As a result, metrics observed under the close-world assumption exhibit a log trend w.r.t. the true metric — this means if you get 0.4 MRR for your model, chances are that the test knowledge graph is incomplete and your model has already done a good job👍. Maybe we can design some new dataset and evaluation to mitigate such an issue?</p><blockquote><strong>What to expect in 2023</strong>: an inductive model fully transferable to different KGs with new sets of entities and relations, e.g., training on Wikidata, and running inference on DBpedia or Freebase.</blockquote><h3>Algorithmic Reasoning and Alignment</h3><p>2022 was a year of major breakthroughs and milestones for algorithmic reasoning.</p><p>1️⃣ First, the <a href="https://github.com/deepmind/clrs"><strong>CLRS benchmark</strong></a> by <a href="https://arxiv.org/abs/2205.15659">Veličković et al.</a> is now available as the main playground to design and benchmark algorithmic reasoning models and tasks. CLRS already includes 30 tasks (such as classical sorting algorithms, string algorithms, and graph algorithms) but still allows you to bring your own formulations or modify existing ones.</p><p>2️⃣ Then, a <strong>Generalist Neural Algorithmic Learner</strong> by <a href="https://openreview.net/forum?id=FebadKZf6Gd">Ibarz et al.</a> and DeepMind has shown that it is possible to train a <em>single</em> processor network that can be trained in the multi-task mode on different algorithms — previously, you’d train a single model for a single task repeating that for all 30 CLRS problems. The paper also describes several modifications and tricks to the model architecture side and training procedure to let the model generalize better and prevent forgetting, e.g., triplet reasoning similar to triangular attention (common for molecular models) and <a href="https://arxiv.org/abs/2112.00578">edge transformers</a>. Overall, a new model brings a massive 25% absolute gain over baselines and solves 24 out of 30 CLRS tasks with 60%+ micro-F1.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*5HX099G5aCmAYlK5" /><figcaption>Source: <a href="https://openreview.net/forum?id=FebadKZf6Gd">Ibarz et al.</a></figcaption></figure><p>3️⃣ Last year, we <a href="https://towardsdatascience.com/graph-ml-in-2022-where-are-we-now-f7f8242599e0#72d1">discussed</a> the works on algorithmic alignment and saw the signs that GNNs can probably align well with dynamic programming. In 2022, <a href="https://openreview.net/forum?id=wu1Za9dY1GY">Dudzik and Veličković</a> prove that <strong>GNNs are Dynamic Programmers</strong> using category theory, abstract algebra, and notion of <em>pushforward</em> and <em>pullback</em> operations. This is a wonderful example of applying category theory that many people consider “abstract nonsense” 😉. Category theory is likely to have more impact in GNN theory and Graph ML in general, so check out a fresh course <a href="https://cats.for.ai/">Cats4AI</a> for a gentle introduction to the field.</p><p>4️⃣ Finally, the work of <a href="https://openreview.net/forum?id=AiY6XvomZV4">Beurer-Kellner et al.</a> is one of the first practical application of the neural algorithmic reasoning framework, here it is applied to configuring computer networks, i.e., routing protocols like BGP that are at the core of the internet. There, the authors show that representing a routing config as a graph allows to frame the routing problem as node property prediction. This approach brings whopping 👀 <strong>490x</strong> 👀 speedups compared to traditional rule-based routing methods and stil maintain 90+% specification consistency.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*Kyy7HdkpCjs8RI6k" /><figcaption>Source: <a href="https://openreview.net/forum?id=AiY6XvomZV4">Beurer-Kellner et al.</a></figcaption></figure><p>If you want to follow algorithmic reasoning more closely, don’t miss a fresh <a href="https://algo-reasoning.github.io/">LoG 2022 tutorial</a> by ​​Petar Veličković, Andreea Deac and Andrew Dudzik.</p><blockquote><strong>What to expect in 2023: <em>1️⃣</em> </strong>Algorithmic reasoning tasks are likely to scale to graphs of thousands of nodes and practical applications like in code analysis or databases, <em>2️⃣</em> even more algorithms in the benchmark, <em>3️⃣</em> most unlikely — there will appear a model capable of solving quickselect <em>😅</em></blockquote><h3>Cool GNN Applications</h3><p>👃<strong> Learning to Smell with GNNs.</strong> Back in 2019, Google AI started a <a href="https://ai.googleblog.com/2019/10/learning-to-smell-using-deep-learning.html">project</a> on learning representations of smells. From basic chemistry we know that aromaticity depends on the molecular structure, e.g., cyclic compounds. In fact, the whole group of ”aromatic hydrocarbons” was named <em>aromatic</em> because they actually has some smell (compared to many non-organic molecules). If we have a molecular structure, we can employ a GNN on top of it and learn some representations!</p><p>Recently, Google AI released <a href="https://ai.googleblog.com/2022/09/digitizing-smell-using-molecular-maps.html">a new blogpost</a> and paper by <a href="https://www.biorxiv.org/content/10.1101/2022.07.21.500995v3">Qian et al.</a> describing the next phase of the project — the <strong>Principal Odor Map</strong> that is able to group molecules in “odor clusters”. The authors conducted 3 cool experiments: classifying 400 new molecules never smelled before and comparison to the averaged rating of a group of human panelists; linking odor quality to fundamental biology; and probing aromatic molecules on their mosquito repelling qualities. The GNN-based model shows very good results — now we can finally claim that GNNs can smell! Looking forward for GNNs transforming the perfume industry.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*Mlteassl5G0tMz4M" /><figcaption>Embedding of odors. Source: <a href="https://ai.googleblog.com/2022/09/digitizing-smell-using-molecular-maps.html">Google AI blog</a></figcaption></figure><p>⚽<strong> GNNs + Football.</strong> If you thought that sophisticated GNNs for modelling trajectories are only used for molecular dynamics and arcane quantum simulations, fear not! Here is a cool practical application with a very high potential outreach: <strong>Graph Imputer</strong> by <a href="https://www.nature.com/articles/s41598-022-12547-0.epdf?sharing_token=HmyoHCAtNdoDfjlObtCiltRgN0jAjWel9jnR3ZoTv0NzQifNnvllGA8o7uZB3n1gdCaC-3jfBQwxpTCJNR7isTeW2uWhYUL8hz8MmWvyYQLogAFNcVp5ZZuTr_O-slFsi4f4-5pz3J2Th9rSxCJV-s63f-q5fojV0FBGNWKYlRQ%3D">Omidshafiei et al.</a>, DeepMind, and FC Liverpool predicts trajectories of football players (and the ball). Each game graph consists of 23 nodes, gets updated with a standard message passing encoder and a special time-dependent LSTM. The dataset is quite novel, too — it consists of 105 English Premier League matches (avg 90 min each), all players and the ball were tracked at 25 fps, and the resulting training trajectory sequences encode about 9.6 seconds of gameplay.</p><p>The paper is easy to read and has numerous football illustrations, check it out! Sports tech is actively growing those days, and football analysts now could go even deeper in studying their competitors. Will EPL clubs compete for GNN researchers in the upcoming transfer windows? Time to create transfermarkt for GNN researchers 😉</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/512/0*o3UD2Ff2J9_g0RU1" /><figcaption>Football match simulation is like molecular dynamics simulation! Source: <a href="https://twitter.com/deepmind/status/1529444212864843777?lang=en">DeepMind</a></figcaption></figure><p>🪐 <strong>Galaxies and Astrophysics. </strong>For astrophysics aficionados: <strong>Mangrove</strong> by <a href="https://arxiv.org/abs/2210.13473">Jespersen et al.</a> applies GraphSAGE to merger trees of dark matter to predict a variety of galactic properties like stellar mass, cold gas mass, star formation rate, and even black hole mass. The paper is a bit heavy on the terminology of astrophysics but pretty easy in terms of GNN parameterization and training. Mangrove works 4–9 orders of magnitude faster than standard models. Experimental charts are pieces of art that you can hang on a wall 🖼️.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*7I0fOAqTI9jDFjyk" /><figcaption>Mangrove approach to present dark matter halos as merger trees and graphs. Source: <a href="https://arxiv.org/abs/2210.13473">Jespersen et al.</a></figcaption></figure><p>🤖 <strong>GNNs for code</strong>. Code generation models like AlphaCode and Codex have mindblowing capabilities. Although LLMs are at the core of those models, GNNs do help in a few neat ways: <strong>Instruction Pointer Attention GNNs</strong> (IPA-GNNs) first proposed by <a href="https://arxiv.org/abs/2010.12621">Bieber et al</a> have been used to <a href="https://arxiv.org/abs/2203.03771">predict runtime errors</a> in competitive programming tasks — so it is almost like a virtual code interpreter! <strong>CodeTrek</strong> by <a href="https://openreview.net/forum?id=WQc075jmBmf">Pashakhanloo et al.</a> proposes to model a program as a relational graph and embed it via random walks and Transformer encoder. Downstream applications include variable misuse, prediction exceptions, predicting shadowed variables.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*RQGdLHQX9avnnyjV" /><figcaption>Source: <a href="https://openreview.net/forum?id=WQc075jmBmf">Pashakhanloo et al.</a></figcaption></figure><h3>Hardware: IPUs and Graphcore Win OGB Large-Scale Challenge 2022</h3><p>🥇 2022 brought a huge success to <a href="https://www.graphcore.ai/">Graphcore</a> and <a href="https://www.graphcore.ai/bow-processors">IPUs</a> — the hardware optimized for sparse operations that are so needed when working with graphs. The first success story was optimizing Temporal Graph Nets (TGN) for IPUs with massive performance gains (check the <a href="https://towardsdatascience.com/accelerating-and-scaling-temporal-graph-networks-on-the-graphcore-ipu-c15ac309b765">article</a> in Michael Bronstein’s blog).</p><p><a href="https://towardsdatascience.com/accelerating-and-scaling-temporal-graph-networks-on-the-graphcore-ipu-c15ac309b765">Accelerating and scaling Temporal Graph Networks on the Graphcore IPU</a></p><p>Later on, Graphcore <a href="https://www.graphcore.ai/posts/graphcore-claims-double-win-in-open-graph-benchmark-challenge">stormed the leaderboards</a> of OGB LSC’22 by winning 2 out of 3 tracks: link prediction on the <strong>WikiKG90M v2</strong> knowledge graph and graph regression on the <strong>PCQM4M v2</strong> molecular dataset. In addition to the sheer compute power, the authors took several clever model decisions: for link prediction it was <a href="https://arxiv.org/abs/2211.12281">Balanced Entity Sampling and Sharing (BESS)</a> for training an ensemble of shallow LP models (check the <a href="https://towardsdatascience.com/large-scale-knowledge-graph-completion-on-ipu-4cf386dfa826">blog post</a> by Daniel Justus for more details), and GPS++ for the graph regression task (we covered GPS++ above in the GT section). You can <a href="https://ipu.dev/3FwVoLD">try out</a> the pre-trained models using IPUs-powered virtual machines on Paperspace. Congratulations to Graphcore and their team! 👏</p><p>PyG partnered with NVIDIA (<a href="https://pyg.org/ns-newsarticle-accelerating-pyg-on-nvidia-gpus">post</a>) and Intel (<a href="https://pyg.org/news/accelerating-pyg-on-intel-cpus">post</a>) to increase the performance of core operations on GPUs and CPUs, respectively. Similarly, DGL <a href="https://www.dgl.ai/release/2022/07/25/release.html">incorporated</a> new GPU optimizations in the recent 0.9 version. Massive gains for sparse matmuls and sampling procedures, so we’d encourage you to update your environments with the most recent versions!.</p><blockquote><strong>What to expect in 2023</strong>: major GNN libraries are likely to increase the breadth of supported hardware backends such as IPUs or upcoming Intel Max Series GPUs.</blockquote><h3>New Conferences: Learning of Graphs (LoG) and Molecular ML (MoML)</h3><p>This year we witnessed the inauguration of two graph and geometric ML conferences: the <a href="https://logconference.org/#hero">Learning on Graphs Conference (LoG)</a> and the <a href="https://www.moml22.mit.edu/">Molecular ML Conference</a> (MoML).</p><p>LoG is a more general all-around GraphML venue (held virtually this year) while MoML (held at MIT) has a broader mission and influence over the AI4Science community where graphs and geometry still plays a major role. Both conferences were received extremely well. MoML attracted 7 top speakers and 38 posters, LoG had ~3000 registrations, 266 submissions, 71 posters, 12 orals, and 7 awesome tutorials (all recordings of oral talks and tutorials are <a href="https://www.youtube.com/@learningongraphs">already on YouTube</a>). Besides, LoG introduced a great monetary incentive for reviewers, resulting in a well-recognized improvement of the review quality! From our point of view, quality of LoG reviews was often better than those at NeurIPS or ICML.</p><p>This is a huge win and carnival for the graph ML community, and congrats to everyone working in the field of graph and geometric machine learning with a new “home” venue!</p><blockquote><strong>What to expect in 2023:</strong> LOG and MoML become main Graph ML venues to include into your submission calendar along with ICLR / NeurIPS / ICML</blockquote><h3>Courses and Educational Materials</h3><ul><li>Geometric Deep Learning Course — <a href="https://www.youtube.com/playlist?list=PLn2-dEmQeTfSLXW8yXP4q_Ii58wFdxb3C">Second Edition</a> (2022) is already on YouTube. The main entry point to the field.</li><li><a href="https://uvagedl.github.io/">An Introduction to Group Equivariant Deep Learning</a> by Erik Bekkers — one of the best new courses about equivariance and equivariant models!</li><li><a href="https://cats.for.ai/">Cats4AI</a> — a new course by Andrew Dudzik, Bruno Gavranović, João Guilherme Araújo, Petar Veličković, and Pim de Haan is the best place to learn about category theory and its connections to Geometric DL.</li><li>Summer School proceedings: <a href="https://www.sci.unich.it/geodeep2022/#home">Italian Summer School on Geometric DL</a>, London Geometry and Machine Learning (<a href="https://www.logml.ai/home-2022">LOGML</a>) Summer School, <a href="https://www.birs.ca/events/2022/5-day-workshops/22w5125">BIRS Workshop on Topological Representation Learning</a>.</li><li><a href="https://snap.stanford.edu/graphlearning-workshop-2022/">Stanford Graph Learning Workshop 2022</a> — latest news from PyG developers and partners and Stanford researchers.</li></ul><h3>New Datasets, Benchmarks, and Challenges</h3><ul><li><a href="https://ogb.stanford.edu/neurips2022/">OGB Large-Scale Challenge 2022</a>: The second large scale challenge held at NeurIPS2022 with large and realistic graph ML tasks covering node-, edge-, graph-level predictions.</li><li><a href="https://opencatalystproject.org/challenge.html">Open Catalyst 2022 Challenge</a>: the second edition of the challenge held at NeurIPS2022 with the task to design new machine learning models to predict the outcome of catalyst simulations used to understand activity</li><li><a href="https://predictioncenter.org/casp15/index.cgi">CASP 15</a>: the protein structure prediction challenge disrupted by AlphaFold a few years ago at CASP 14. Detailed analysis is yet to come, but it seems that MSAs strike back and best performing models still rely on MSAs.</li><li><a href="https://arxiv.org/abs/2206.08164">Long Range Graph Benchmark</a>: for measuring GNNs and GTs capabilities of capturing long range interactions in graphs.</li><li><a href="https://arxiv.org/abs/2206.07729">Taxonomy of Graph Benchmarks</a>, <a href="https://github.com/Graph-Learning-Benchmarks/gli">Graph Learning Indexer</a>: deeper studies of the dataset landscape in Graph ML outlining open challenges in benchmarking and trustworthiness of results.</li><li><a href="https://ai.googleblog.com/2022/05/graphworld-advances-in-graph.html">GraphWorld</a>: a framework for analyzing the performance of GNN architectures on millions of synthetic benchmark datasets</li><li><a href="https://openreview.net/forum?id=10iA3OowAV3">Chartalist</a> — a collection of blockchain graph datasets</li><li><a href="https://github.com/DeepGraphLearning/PEER_Benchmark">PEER protein learning benchmark</a>: a multi-task benchmark for protein sequence understanding with 17 tasks of protein understanding lying in 5 task categories.</li><li><a href="https://esmatlas.com/">ESM Metagenomic Atlas</a>: acomprehensive database of over 600 million predicted protein structures with nice visualizations and search UI.</li></ul><h3>Software Libraries and Open Source</h3><ul><li>Mainstream graph ML libraries: <a href="https://www.pyg.org/">PyG 2.2</a> (PyTorch), <a href="https://www.dgl.ai/">DGL 0.9</a> (PyTorch, TensorFlow, MXNet), <a href="https://github.com/tensorflow/gnn">TF GNN</a> (TensorFlow) and <a href="https://github.com/deepmind/jraph">Jraph</a> (Jax)</li><li><a href="https://torchdrug.ai/">TorchDrug</a> and <a href="https://torchprotein.ai/">TorchProtein</a>: machine learning library for drug discovery and protein science</li><li><a href="https://github.com/pykeen/pykeen">PyKEEN</a>: the best platform for training and evaluating knowledge graph embeddings</li><li><a href="https://graphein.ai/">Graphein</a>: a package that provides a number of types of graph-based representations of proteins</li><li><a href="https://github.com/AnacletoLAB/grape">GRAPE</a> and <a href="https://marius-project.org/">Marius</a>: scalable graph processing and embedding libraries over billion-scale graphs</li><li><a href="https://github.com/IntelLabs/matsciml">MatSci ML Toolkit</a>: a flexible framework for deep learning on the opencatalyst dataset</li><li><a href="https://github.com/e3nn/e3nn">E3nn</a>: the go-to library for E(3) equivariant neural networks</li></ul><h3>Join the Community</h3><ul><li>Reading Groups: <a href="https://m2d2.io/talks/log2/about/">Learning on Graphs and Geometry</a> (LOG2) reading group, <a href="https://m2d2.io/talks/m2d2/about/">Molecular Modeling &amp; Drug Discovery</a> (M2D2) reading group, and their Slack communities</li><li>Learning of Graphs (LoG) <a href="https://logconference.org/">Slack community</a></li><li><a href="https://michael-bronstein.medium.com/">Michael Bronstein’s blog on Medium</a></li><li><a href="https://medium.com/@pytorch_geometric">PyG medium</a>, <a href="https://pyg.org/blogs-and-tutorials">blog posts</a>, and newsletter</li><li><a href="https://t.me/graphML">GraphML Telegram channel</a></li></ul><h3>The Meme of 2022 🪓</h3><figure><img alt="" src="https://cdn-images-1.medium.com/max/462/0*p4pvs3dlrcsd7MOQ" /><figcaption>Created by Michael Galkin and Michael Bronstein</figcaption></figure><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=1ba920cb9232" width="1" height="1" alt=""><hr><p><a href="https://medium.com/data-science/graph-ml-in-2023-the-state-of-affairs-1ba920cb9232">Graph ML in 2023: The State of Affairs</a> was originally published in <a href="https://medium.com/data-science">TDS Archive</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Denoising Diffusion Generative Models in Graph ML]]></title>
            <link>https://medium.com/data-science/denoising-diffusion-generative-models-in-graph-ml-c496af5811c5?source=rss-4d4f8ddd1e68------2</link>
            <guid isPermaLink="false">https://medium.com/p/c496af5811c5</guid>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[editors-pick]]></category>
            <category><![CDATA[geometric-deep-learning]]></category>
            <category><![CDATA[drug-discovery]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <dc:creator><![CDATA[Michael Galkin]]></dc:creator>
            <pubDate>Sat, 26 Nov 2022 21:42:17 GMT</pubDate>
            <atom:updated>2022-11-28T14:21:05.058Z</atom:updated>
            <content:encoded><![CDATA[<h4>What’s new in Graph ML?</h4><h4>Is Denoising Diffusion all you need?</h4><p>The breakthrough in <a href="https://arxiv.org/abs/2006.11239">Denoising Diffusion Probabilistic Models</a> (DDPM) happened about 2 years ago. Since then, we observe dramatic improvements in generation tasks: <a href="https://arxiv.org/abs/2112.10741">GLIDE</a>, <a href="https://openai.com/dall-e-2/">DALL-E 2</a>, <a href="https://gweb-research-imagen.appspot.com/paper.pdf">Imagen</a>, <a href="https://github.com/Stability-AI/stablediffusion">Stable Diffusion</a> for images, <a href="https://arxiv.org/pdf/2205.14217.pdf">Diffusion-LM</a> in language modeling, diffusion for <a href="https://arxiv.org/pdf/2205.09853.pdf">video sequences</a>, and even <a href="https://arxiv.org/pdf/2205.09991.pdf">diffusion for reinforcement learning</a>.</p><p>Diffusion might be the biggest trend in GraphML in 2022 — particularly when applied to drug discovery, molecules and conformer generation, and quantum chemistry in general. Often, they are paired with the latest advancements in equivariant GNNs.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/768/1*mS6STisUN8L_xdRyvJy7CA.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/768/1*KmLTqvHwaWMWs5rkEOrjUg.png" /><figcaption>Molecule generation. Generated with <a href="https://huggingface.co/spaces/stabilityai/stable-diffusion">Stable Diffusion 2</a></figcaption></figure><h3>The Basics: Diffusion and Diffusion on Graphs</h3><p>Let’s recapitulate the basics of diffusion models using the example of the Equivariant Diffusion paper by <a href="https://arxiv.org/abs/2203.17003">Hoogeboom et al</a> using as few equations as possible 😅</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*1jVaq8gtLlu8s9Qe" /><figcaption>Forward and backward diffusion processes. Forward process q(z|x,h) gradually adds noise to the graph up to the stage when it becomes a Gaussian noise. Backward process p(x,h|z) starts from the Gaussian noise and gradually denoises the graph up to the stage when it becomes a valid graph. Source: <a href="https://arxiv.org/pdf/2203.17003.pdf"><strong>Hoogeboom, Satorras, Vignac, and Welling</strong></a>.</figcaption></figure><ul><li>Input: a graph (<em>N,E</em>) with <em>N</em> nodes and <em>E</em> edges</li><li>Node features often have two parts: <em>z=[x,h]</em> where <em>x</em> ∈ R³ are 3D coordinates and <em>h</em> ∈ R^d are categorical features like atom types</li><li>(Optional) Edge features are bond types</li><li>Output: a graph (<em>N,E</em>) with nodes, edges, and corresponding features</li><li><strong>Forward diffusion</strong> process <em>q(z_t | x,h)</em>: at each time step <em>t</em>, inject noise to the features such that at the final step <em>T</em> they become a white noise</li><li><strong>Reverse diffusion</strong> process <em>p(z_{t-1} | z_t)</em>: at each time step <em>t-1,</em> ask the model to predict the noise and <em>“subtract”</em> it from the input such that at the final step <em>t=0</em> we have a new valid generated graph</li><li>A <strong>denoising</strong> neural network learns to predict the injected noise</li><li>Denoising diffusion is known to be equivalent to <em>score-based matching</em> [<a href="https://arxiv.org/abs/1907.05600"><strong>Song and Ermon (2019</strong></a><strong>)</strong> and <a href="https://arxiv.org/abs/2011.13456"><strong>Song et al. (2021</strong></a><strong>)</strong>] where a neural network learns to predict the score <em>∇_x log p_t(x)</em> of the diffused data. The score-based perspective describes the forward/reverse processes with <a href="https://en.wikipedia.org/wiki/Stochastic_differential_equation">Stochastic Differential Equations</a> (SDEs) with the <a href="https://en.wikipedia.org/wiki/Wiener_process">Wiener process</a></li></ul><blockquote>Emiel Hoogeboom, Victor Garcia Satorras, Clément Vignac, Max Welling. <a href="https://arxiv.org/pdf/2203.17003.pdf">Equivariant Diffusion for Molecule Generation in 3D</a>. ICML 2022. <a href="https://github.com/ehoogeboom/e3_diffusion_for_molecules">GitHub</a></blockquote><p>The work introduces an equivariant diffusion model (<strong>EDM</strong>) for molecule generation that has to maintain E(3) equivariance over atom coordinates <em>x</em> (as to <em>rotation</em>, <em>translation</em>, <em>reflection</em>) and while node features <em>h </em>(such as atom types) remain invariant. Importantly, atoms have different feature modalities: atom charge is an ordinal integer, atom types are one-hot categorical features, and atom coordinates are continuous features, so the authors design feature-specific noising processes and loss functions, and scale input features for training stability.</p><p>EDM employs an equivariant <a href="https://arxiv.org/pdf/2102.09844.pdf">E(n) GNN</a> as a neural network that predicts noise based on input features and time step. At inference time, we first sample the desired number of atoms <em>M</em>, then we can condition EDM on a desired property <em>c</em>, and ask EDM to generate molecules (defined by features <em>x</em> and <em>h</em>) as <em>x, h ~ p(x,h | c, M)</em>.</p><p>Experimentally, EDM outperforms normalizing flow- and VAE-based approaches by a large margin in terms of achieved negative log-likelihood, molecule stability, and uniqueness. Ablations demonstrate that an equivariant GNN encoder is crucial as replacing it with a standard MPNN leads to significant performance drops.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/0*fYHw-wE0C9ZKWgdL.gif" /><figcaption>Diffusion-based generation visualization. Source: <a href="https://twitter.com/emiel_hoogeboom/status/1509838163375706112">Twitter</a></figcaption></figure><h3>DiGress: Diffusion for Graph Generation</h3><blockquote>Clement Vignac, Igor Krawczuk, Antoine Siraudin, Bohan Wang, Volkan Cevher, Pascal Frossard. <a href="https://arxiv.org/abs/2209.14734">DiGress: Discrete Denoising diffusion for graph generation</a>. <a href="https://github.com/cvignac/DiGress">GitHub</a></blockquote><p><a href="https://arxiv.org/abs/2209.14734">DiGress</a> by Clemént Vignac, Igor Krawczuk, and the EPFL team is the unconditional <strong>graph generation</strong> model (although with the possibility to incorporate a score-based function for conditioning on graph-level features like energy MAE). DiGress is a discrete diffusion model, that is, it operates on discrete node types (like atom types C, N, O) and edge types (like single / double / triple bond) where adding noise to a graph corresponds to multiplication with the transition matrix (from one type to another) mined as marginal probabilities from the training set. The denoising neural net is a modified Graph Transformer. DiGress works for many graph families — planar, SBMs, and molecules, <a href="https://github.com/cvignac/DiGress">code</a> is available, and check the <a href="https://www.youtube.com/watch?v=k2saMtP-Fn8">video</a> from the LoGaG reading group presentation!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*kBtwYwodtGNjOah6Pf8cGA.png" /><figcaption>DiGress diffusion process. Source: <a href="https://arxiv.org/pdf/2209.14734.pdf"><strong>Vignac, Krawczuk, et al.</strong></a></figcaption></figure><h3>GeoDiff and Torsional Diffusion: Molecular Conformer Generation</h3><p>Having a molecule with 3D coordinates of its atoms, <strong>conformer generation</strong> is the task of generating another set of <strong>valid</strong> 3D coordinates with which a molecule can exist. Recently, we have seen GeoDiff and Torsional Diffusion that applied the diffusion framework to this task.</p><blockquote>Minkai Xu, Lantao Yu, Yang Song, Chence Shi, Stefano Ermon, Jian Tang. <a href="https://arxiv.org/abs/2203.02923">GeoDiff: a Geometric Diffusion Model for Molecular Conformation Generation</a>. ICLR 2022. <a href="https://github.com/MinkaiXu/GeoDiff">GitHub</a></blockquote><p><a href="https://arxiv.org/abs/2203.02923">GeoDiff</a> is the SE(3)-equivariant diffusion model for generating conformers of given molecules. Diffusion is applied to 3D coordinates that gradually get transformed to Gaussian noise (forward process). The reverse process denoises a random sample to a valid set of atomic coordinates. GeoDiff defines an equivariant diffusion framework in the Euclidean space (that postulates which kind of noise can be added) and applies an equivariant GNN as the denoising model. The denoising GNN, a <em>Graph Field Network</em>, is an extension of rather standard EGNNs<em>. </em>For the first time, GeoDiff showed how<em> much better</em> the diffusion models are compared to normalizing flows and VAE-based models 💪</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*YEDDRX_VzDaG5u3gKa6JEA.png" /><figcaption>GeoDiff. Source: <a href="https://arxiv.org/pdf/2203.02923.pdf"><strong>Xu et al.</strong></a></figcaption></figure><blockquote>Bowen Jing, Gabriele Corso, Jeffrey Chang, Regina Barzilay, Tommi Jaakkola. <a href="https://arxiv.org/abs/2206.01729">Torsional Diffusion for Molecular Conformer Generation</a>. NeurIPS 2022. <a href="https://github.com/gcorso/torsional-diffusion">GitHub</a></blockquote><p>While GeoDiff diffuses 3D coordinates of atoms in the Euclidean space, <a href="https://arxiv.org/pdf/2206.01729.pdf">Torsional Diffusion</a> proposes a neat way to perturb torsion angles in freely rotatable bonds of molecules. Since the number of such rotatable bonds is always much smaller than the number of atoms (on average in GEOM-DRUGS, 44 atoms vs 8 torsion angles per molecule), generation can potentially be much faster. The tricky part is that torsion angles do not form a Euclidean space, but rather a <a href="https://en.wikipedia.org/wiki/Torus">hypertorus</a> (a donut 🍩), so adding some Gaussian noise to coordinates won’t work — instead, the authors design a novel perturbation kernel as the <em>wrapped normal distribution </em>(from real space but modulated by <em>2pi</em>)<em>.</em> Torsional Diffusion applies the score-based perspective to training and generation where the score model has to be SE(3)-<strong>invariant</strong> and sign-<strong>equivariant</strong>. The score model is a variation of the <a href="https://arxiv.org/abs/1802.08219">Tensor Field Network</a>.</p><p>Experimentally, Torsional Diffusion indeed works faster — it only needs 5–20 steps compared to 5000 steps of GeoDiff, and is currently a SOTA in conformer generation 🚀</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*yP6I3nWbI8dunn_s3ecVrQ.png" /><figcaption>Torsional Diffusion. Source: <a href="https://arxiv.org/pdf/2206.01729.pdf"><strong>Jing, Corso, et al.</strong></a></figcaption></figure><h3>DiffDock: Diffusion for Molecular Docking</h3><blockquote>Gabriele Corso, Hannes Stärk, Bowen Jing, Regina Barzilay, Tommi Jaakkola. <a href="https://arxiv.org/abs/2210.01776">DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking</a>. <a href="https://github.com/gcorso/DiffDock">GitHub</a></blockquote><p><a href="https://arxiv.org/abs/2210.01776">DiffDock</a> is the score-based generative model for <strong>molecular docking</strong>, eg, given a ligand and a protein, <strong>predicting how a ligand binds to a target protein</strong>. DiffDock runs the diffusion process over translations T(3), rotations SO(3), and torsion angles SO(2)^m in the product space: (1) positioning of the ligand wrt the protein (often called binding pockets), the pocket is unknown in advance so it is <em>blind docking</em>, (2) defining rotational orientation of the ligand, and (3) defining torsion angles of the conformation (see the Torsional Diffusion above for reference).</p><p>DiffDock trains 2 models: the score model for predicting actual coordinates and the confidence model for estimating the likelihood of the generated prediction. Both models are SE(3)-equivariant networks over point clouds, but the bigger score model (in terms of parameter count) works on protein residues from alpha-carbons (initialized from the <a href="https://github.com/facebookresearch/esm">now-famous ESM2</a> protein LM) while the confidence model uses the fine-grained atom representations. Initial ligand structures are generated by RDKit. DiffDock dramatically improves the prediction quality and you can even upload your own proteins (PDB) and ligands (SMILES) in the <a href="https://huggingface.co/spaces/simonduerr/diffdock">online demo on HuggingFace spaces</a> to test it out!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*V3IMIRAFgNeaYj5ZQeKlOg.png" /><figcaption>DiffDock intuition. Source: <a href="https://arxiv.org/pdf/2210.01776.pdf"><strong>Corso, Stärk, Jing, et al.</strong></a></figcaption></figure><h3>DiffSBDD: Diffusion for Generating Novel Ligands</h3><blockquote>Arne Schneuing, Yuanqi Du, Charles Harris, Arian Jamasb, Ilia Igashov, Weitao Du, Tom Blundell, Pietro Lió, Carla Gomes, Max Welling, Michael Bronstein, Bruno Correia. <a href="https://arxiv.org/abs/2210.13695">Structure-based Drug Design with Equivariant Diffusion Models</a>. <a href="https://github.com/arneschneuing/DiffSBDD">GitHub</a></blockquote><p><a href="https://arxiv.org/abs/2210.13695">DiffSBDD</a> is the diffusion model for <strong>generating novel ligands conditioned on the protein pocket.</strong> DiffSBDD can be implemented with 2 approaches: (1) pocket-conditioned ligand generation when the pocket is fixed; (2) inpainting-like generation that approximates the joint distribution of pocket-ligand pairs. In both approaches, DiffSBDD relies on the tuned equivariant diffusion model (<a href="https://towardsdatascience.com/graph-machine-learning-icml-2022-252f39865c70#7cf5">EDM, ICML 2022</a>) and equivariant <a href="https://arxiv.org/pdf/2102.09844.pdf">EGNN</a> as the denoising model. Practically, ligands and proteins are represented as point clouds with categorical features and 3D coordinates (proteins can be alpha-carbon residues or full atoms, one-hot encoding of residues — ESM2 could be used here in future), so diffusion is performed over the 3D coordinates ensuring equivariance.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*lVTpyuid7EKF9mScNOZg2w.png" /><figcaption>DiffSBDD. Source: <a href="https://arxiv.org/pdf/2210.13695.pdf"><strong>Schneuing, Du, et al.</strong></a></figcaption></figure><h3>DiffLinker: Diffusion for Generating Molecular Linkers</h3><blockquote>Ilya Igashov, Hannes Stärk, Clément Vignac, Victor Garcia Satorras, Pascal Frossard, Max Welling, Michael Bronstein, Bruno Correia. <a href="https://arxiv.org/abs/2210.05274">Equivariant 3D-Conditional Diffusion Models for Molecular Linker Design</a>. <a href="https://github.com/igashov/DiffLinker">GitHub</a></blockquote><p><a href="https://arxiv.org/abs/2210.05274">DiffLinker</a> is the diffusion model for <strong>generating molecular linkers</strong> conditioned on 3D fragments. While previous models are autoregressive (hence not permutation equivariant) and can only link 2 fragments, DiffLinker generates the whole structure and can link 2+ fragments. In DiffLinker, each point cloud is conditioned on the context (all other known fragments and/or protein pocket), the context is usually fixed. The diffusion framework is similar to EDM but is now conditioned on the 3D data rather than on scalars. The denoising model is the same equivariant EGNN. Interestingly, DiffLinker has an additional module to predict the linker size (number of molecules) so you don’t have to specify it beforehand.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*kMXcUamd7tcICNypQpkVwQ.png" /><figcaption>DiffLinker. Source: <a href="https://arxiv.org/pdf/2210.05274.pdf"><strong>Igashov et al.</strong></a></figcaption></figure><h3><strong>Learn More</strong></h3><ul><li><a href="https://arxiv.org/abs/2206.04119">SMCDiff</a> for generating protein scaffolds conditioned on the desired motif (also with EGNN).</li><li>Generally, in graph and molecule generation we’d like to support some discreteness, so any improvements to the discrete diffusion are very welcome, eg, <a href="https://arxiv.org/abs/2210.14784">Richemond, Dieleman, and Doucet propose</a> a new simplex diffusion for categorical data with the Cox-Ingersoll-Ross SDE (rare find!).</li><li>Discrete diffusion is also studied for text generation in the recent <a href="https://arxiv.org/abs/2210.16886">DiffusER</a>.</li><li>Hugging Face maintains the 🧨 <a href="https://github.com/huggingface/diffusers">Diffusers</a> library starts the <a href="https://github.com/huggingface/diffusion-models-class">open course on Diffusion Models</a> — check them out for practical implementation tips</li><li>Check the recordings of the <a href="https://cvpr2022-tutorial-diffusion-models.github.io/">CVPR 2022 tutorial on diffusion models</a> by Karsten Kreis, Ruiqi Gao, and Arash Vahdat</li></ul><p>We’ll spare your browser tabs for now 📚 but do expect more diffusion models in Geometric DL!</p><p><em>A special thanks goes to </em><a href="https://hannes-stark.com/"><em>Hannes Stärk</em></a><em> and </em><a href="https://rampasek.github.io/"><em>Ladislav Rampášek</em></a><em> for proofreading the post! Follow </em><a href="https://twitter.com/HannesStaerk"><em>Hannes</em></a><em>, </em><a href="https://twitter.com/rampasek"><em>Ladislav</em></a><em>, and </em><a href="https://twitter.com/michael_galkin"><em>me</em></a><em> on Twitter, or subscribe to the </em><a href="https://t.me/graphML"><em>GraphML</em></a><em> channel in Telegram.</em></p><figure><img alt="" src="https://cdn-images-1.medium.com/max/768/1*QEHR5PGpu0LWK-jFuyiQYQ.png" /></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/768/1*Y8eeMA5aES2mSEAdN5fg6Q.png" /><figcaption>Molecule generation. Generated with <a href="https://huggingface.co/spaces/stabilityai/stable-diffusion">Stable Diffusion 2</a></figcaption></figure><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=c496af5811c5" width="1" height="1" alt=""><hr><p><a href="https://medium.com/data-science/denoising-diffusion-generative-models-in-graph-ml-c496af5811c5">Denoising Diffusion Generative Models in Graph ML</a> was originally published in <a href="https://medium.com/data-science">TDS Archive</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[Graph Machine Learning @ ICML 2022]]></title>
            <link>https://medium.com/data-science/graph-machine-learning-icml-2022-252f39865c70?source=rss-4d4f8ddd1e68------2</link>
            <guid isPermaLink="false">https://medium.com/p/252f39865c70</guid>
            <category><![CDATA[machine-learning]]></category>
            <category><![CDATA[deep-dives]]></category>
            <category><![CDATA[graph]]></category>
            <category><![CDATA[knowledge-graph]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <dc:creator><![CDATA[Michael Galkin]]></dc:creator>
            <pubDate>Mon, 25 Jul 2022 06:40:43 GMT</pubDate>
            <atom:updated>2022-07-25T13:42:42.505Z</atom:updated>
            <content:encoded><![CDATA[<h4>What’s New in GraphML?</h4><h4>Recent advancements and hot trends, July 2022 edition</h4><p><a href="https://icml.cc/Conferences/2022/">International Conference on Machine Learning (ICML)</a> is one of the premier venues where researchers publish their best work. ICML 2022 was packed with hundreds of papers and <a href="https://icml.cc/Conferences/2022/Schedule?type=Workshop">numerous workshops</a> dedicated to graphs. We share the overview of the hottest research areas 🔥 in Graph ML.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*WM79_JkL45YRuDGiJ4S7pQ.png" /></figure><p><em>This post was written by </em><a href="https://twitter.com/michael_galkin"><em>Michael Galkin</em></a><em> (Mila) and </em><a href="https://twitter.com/zhu_zhaocheng"><em>Zhaocheng Zhu</em></a><em> (Mila).</em></p><p>We did our best to highlight the major advances in Graph ML at ICML and cover 2–4 papers per topic. Still, due to the sheer volume of accepted papers, we might have missed some works - let us know in comments or on social media.</p><h3>Table of Contents (clickable):</h3><ol><li><a href="#7cf5">Generation: Denoising Diffusion Is All You Need</a></li><li><a href="#b96e">Graph Transformers</a></li><li><a href="#165d">Theory and Expressive GNNs</a></li><li><a href="#5145">Spectral GNNs</a></li><li><a href="#be73">Explainable GNNs</a></li><li><a href="#fe10">Graph Augmentation: Beyond Edge Dropout</a></li><li><a href="#2f59">Algorithmic Reasoning and Graph Algorithms</a></li><li><a href="#bbee">Knowledge Graph Reasoning</a></li><li><a href="#774c">Computational Biology: Molecular Linking, Protein Binding, Property Prediction</a></li><li><a href="#f4b7">Cool Graph Applications</a></li></ol><h3>Generation: Denoising Diffusion Is All You Need</h3><p><strong>Denoising diffusion probabilistic models</strong> (<a href="https://arxiv.org/abs/2006.11239">DDPMs</a>) are taking over the field of Deep Learning in 2022 in pretty much all domains with stunning generation quality and better theoretical properties than GANs and VAEs, e.g., in image generation (<a href="https://arxiv.org/abs/2112.10741">GLIDE</a>, <a href="https://openai.com/dall-e-2/">DALL-E 2</a>, <a href="https://gweb-research-imagen.appspot.com/paper.pdf">Imagen</a>), <a href="https://arxiv.org/pdf/2205.09853.pdf">video generation</a>, text generation (<a href="https://arxiv.org/pdf/2205.14217.pdf">Diffusion-LM</a>), and even <a href="https://arxiv.org/pdf/2205.09991.pdf">diffusion for reinforcement learning</a>. Conceptually, diffusion models gradually add noise to an input object (until it is a Gaussian noise) and learn to predict the added level of noise such that we can subtract it from the object (denoise).</p><p>Diffusion might be <strong>the biggest trend</strong> in GraphML in 2022 — particularly when applied to drug discovery, molecules and conformers generation, and quantum chemistry in general. Often, they are paired with the latest advancements in equivariant GNNs. ICML features several cool implementations of denoising diffusion for graph generation.</p><p>➡️ In “<a href="https://arxiv.org/pdf/2203.17003.pdf"><em>Equivariant Diffusion for Molecule Generation in 3D</em></a><em>”</em> by <strong>Hoogeboom, Satorras, Vignac, and Welling</strong>, the authors define an equivariant diffusion model (<strong>EDM</strong>) for molecule generation that has to maintain E(3) equivariance over atom coordinates <em>x</em> (as to <em>rotation</em>, <em>translation</em>, <em>reflection</em>) and invariance to group transformations over node features <em>h</em>. Importantly, molecules have different feature modalities: atom charge is an ordinal integer, atom types are one-hot categorical features, and atom coordinates are continuous features, so, for instance, you can’t just add Gaussian noise to one-hot features and expect the model to work. Instead, the authors design feature-specific noising processes and loss functions, and scale input features for training stability.</p><p>EDM employs a <a href="https://arxiv.org/pdf/2102.09844.pdf">state-of-the-art E(n) GNN</a> as a neural network that predicts noise based on input features and time step. At inference time, we first sample the desired number of atoms <em>M</em>, then we can condition EDM on a desired property <em>c</em>, and ask EDM to generate molecules (defined by features <em>x</em> and <em>h</em>) as <em>x, h ~ p(x,h | c, M)</em>.</p><p>Experimentally, EDM outperforms normalizing flow- and VAE-based approaches by a large margin in terms of achieved negative log-likelihood, molecule stability, and uniqueness. Ablations demonstrate that an equivariant GNN encoder is crucial as replacing it with a standard MPNN leads to significant performance drops. Code is already <a href="https://github.com/ehoogeboom/e3_diffusion_for_molecules">available on GitHub</a>, try it!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*9boZOQmoy-tN-hBI" /><figcaption>Forward and backward diffusion. Source: <a href="https://arxiv.org/pdf/2203.17003.pdf">Hoogeboom, Satorras, Vignac, and Welling</a>.</figcaption></figure><figure><img alt="" src="https://cdn-images-1.medium.com/max/600/1*OXFrmspWBs0EBJRRuUg8GQ.gif" /><figcaption>Diffusion-based generation visualization. Source: <a href="https://twitter.com/emiel_hoogeboom/status/1509838163375706112">Twitter</a></figcaption></figure><p>➡️ For 2D graphs, <a href="https://arxiv.org/pdf/2202.02514.pdf">Jo, Lee, and Hwang</a> propose <strong>Graph Diffusion via the System of Stochastic Differential Equations</strong> (<strong>GDSS</strong>). While the previous EDM is an instance of denoising diffusion probabilistic model (DDPM), <strong>GDSS</strong> belongs to a sister branch of DDPMs, namely, <strong>score-based models</strong>. In fact, it was <a href="https://openreview.net/pdf?id=PxTIG12RRHS">recently shown (ICLR’21)</a> that DDPMs and score-based models can be unified into the same framework if we describe the forward diffusion process with stochastic differential equations (SDEs).</p><p>SDEs allow to model diffusion in continuous time as <a href="https://en.wikipedia.org/wiki/Wiener_process">Wiener process</a> (for simplicity, let’s say it is a fancy term for the process of adding noise) while DDPMs usually discretize it in 1000 steps (with learnable time embedding) although SDEs would require using specific solvers. Compared to previous score-based graph generators, <strong>GDSS</strong> takes as input (and predicts) both adjacency <em>A</em> and node features <em>X</em>. The forward and backward diffusion expressed as SDEs require computing <em>scores </em>— here gradients of joint log-densities (X, A). For obtaining those densities, we need a <em>score-based model</em>, and here the authors use a <a href="https://openreview.net/pdf?id=JHcqXGaqiGn">GNN with attention pooling</a> (graph multi-head attention).</p><p>At training time, we solve a <strong>forward SDE</strong> and train a score model, while at inference we use the trained score model and solve the <strong>reverse-time SDE</strong>. Usually, you’d employ something like <a href="https://en.wikipedia.org/wiki/Langevin_dynamics">Langevin dynamics</a> here, e.g., Langevin MCMC, but higher-order <a href="https://en.wikipedia.org/wiki/Runge%E2%80%93Kutta_methods">Runge-Kutta</a> solvers should, in principle, work here, too. Experimentally, GDSS outperforms autoregressive generative models and one-shot VAEs by a large margin in 2D graph generation tasks, although sampling speed might still be a bit of a bottleneck due to integrating reverse-time SDEs. <a href="https://github.com/harryjo97/GDSS">GDSS code</a> is already available!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*4xi6x7S6bDoQz4UV" /><figcaption>GDSS intuition. Source: <a href="https://arxiv.org/pdf/2202.02514.pdf">Jo, Lee, and Hwang</a></figcaption></figure><p>👀 Looking at arxiv those days, we’d expect many more diffusion models to be released this year — DDPMs in graphs deserve their own big blog post, stay tuned!</p><p>➡️ Finally, an example of a non-diffusion generation is the work by <a href="https://arxiv.org/pdf/2204.01613.pdf">Martinkus et al</a> who design <a href="https://github.com/KarolisMart/SPECTRE"><strong>SPECTRE</strong></a>, a GAN for one-shot graph generation. Apart from othen GANs who would often generate an adjacency matrix right away, the idea of <strong>SPECTRE</strong> is to condition graph generation on top-k (lowest) eigenvalues and eigenvectors of a Laplacian that already give some notion of clusters and connectivity. 1️⃣ <strong>SPECTRE</strong> generates <em>k</em> eigenvalues. 2️⃣ the authors use a clever trick of sampling eigenvectors from the <a href="https://en.wikipedia.org/wiki/Stiefel_manifold">Stiefel manifold</a> induced by top-k eigenvectors. The Stiefel manifold presents a bank of orthonormal matrices from which we can sample one<em> n x k</em> matrix. 3️⃣Finally, obtaining a Laplacian, the authors use a <a href="https://papers.nips.cc/paper/2019/file/bb04af0f7ecaee4aae62035497da1387-Paper.pdf">Provably Powerful Graph Net</a> to generate the final adjacency.</p><p>Experimentally, <strong>SPECTRE</strong> is orders of magnitude better than other GANs and up to 30x faster than autoregressive graph generators 🚀.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*oT6-JIrrLCAoUu_-" /><figcaption>SPECTRE a 3-step process to generate eigenvalues -&gt; eigenvectors -&gt; adjacency. Source: <a href="https://arxiv.org/pdf/2204.01613.pdf">Martinkus et al</a></figcaption></figure><h3>Graph Transformers</h3><p>We have two papers on improving Graph Transformers at this year’s ICML.</p><p>➡️ First, <a href="https://arxiv.org/pdf/2202.03036.pdf">Chen, O’Bray, and Borgwardt</a> propose a <strong>Structure-Aware Transformer (SAT)</strong>. They notice that self-attention can be rewritten as kernel smoothing where the query-key product is an exponential kernel. It then boils down to finding a more generalized kernel — the authors propose to use functions of a node and graph to add structure awareness, namely, <strong>k-subtree</strong> and <strong>k-subgraph</strong> features. <em>K-subtrees</em> are essentially k-hop neighborhoods and can be mined relatively fast, but are eventually limited to the expressiveness of 1-WL. On the other hand, <em>k-subgraphs</em> are more expensive to compute (and hardly scale) but provide a provably better distinguishing power.</p><p>Whatever featurization you select, those subtrees or subgraphs (extracted for each node) are then encoded through any GNN encoder (eg, PNA), pooled (sum/mean/virtual node), and used as queries and keys in the self-attention computation (see the illustration 👇).</p><p>Experimentally, k of 3 or 4 is enough, and k-subgraph features work expectedly better on graphs where we can afford their computation. Interestingly, positional features like Laplacian eigenvectors and Random Walk features are only helpful for the <em>k-subtree SAT</em> being rather useless for <em>k-subgraph SAT</em>.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*dyM1WMxaooNLzv58" /><figcaption>Source: <a href="https://arxiv.org/pdf/2202.03036.pdf">Chen, O’Bray, and Borgwardt</a></figcaption></figure><p>➡️ Second, <a href="https://arxiv.org/pdf/2107.07999.pdf">Choromanski, Lin, Chen, et al</a> (the team overlaps a lot with the authors of the famous <a href="https://arxiv.org/abs/2009.14794">Performer</a> with linear attention) study principled mechanisms to enable sub-quadratic attention. In particular, they consider relative positional encodings (RPEs) and their variations for different data modalities like images, sounds, video, and graphs. Considering graphs, we know from <a href="https://github.com/microsoft/Graphormer">Graphormer</a> that infusing shortest path distances into attention works well, but requires materialization of the full attention matrix (hence, not scalable). Can we approximate the softmax attention without full materialization but still incorporate useful graph inductive biases? 🤔</p><p>Yes! And the authors propose 2 such mechanisms. (1) Turns out, we can use <strong>Graph Diffusion Kernels (GDK)</strong> — a.k.a. heat kernels — that model a diffusion process of heat propagation and serve as a soft version of shortest paths. Diffusion, however, requires calling solvers for computing matrix exponentials, so the authors design another way. (2) Random Walks Graph-Nodes Kernel (RWGNK) which value (for two nodes) encodes the dot product of frequency vectors (of those two nodes) obtained from random walks starting at those two nodes.</p><p>Random walks are great, we love random walks 😍 Check out the illustration below for a visual description of diffusion and RW kernel results. The final transformer with the RWGNK kernel is called <strong>Graph Kernel Attention Transformer</strong> <strong>(GKAT)</strong> and probed against several tasks from synthetic identification of topological structures in ER graphs to small compbio and social network datasets. <strong>GKAT</strong> shows much better results on synthetic tasks and performs pretty much on par with GNNs on other graphs. Would be great to see a real scalability study pushing the Transformer to the limits of input set size!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*7NXxc31VcjKTc15V" /><figcaption>Source: <a href="https://arxiv.org/pdf/2107.07999.pdf">Choromanski, Lin, Chen, et al</a></figcaption></figure><h3>Theory and Expressive GNNs</h3><p>The GNN community continues to study the ways of breaking through the ceiling of 1-WL expressiveness and retaining at least polynomial time complexity.</p><p>➡️<a href="https://proceedings.mlr.press/v162/papp22a/papp22a.pdf"> Papp and Wattenhofer</a> start with an accurate description of current theoretical studies:</p><blockquote>Whenever a new GNN variant is introduced, the corresponding theoretical analysis usually shows it to be more powerful than 1-WL, and sometimes also compares it to the classical k-WL hierarchy… Can we find a more meaningful way to measure the expressiveness of GNN extensions?</blockquote><p>The authors categorize the literature of expressive GNNs into 4 families: 1️⃣ k-WL and approximations; 2️⃣ substructure counting <strong>(S)</strong>; 3️⃣ subgraph- and neighborhood-aware GNNs <strong>(N)</strong> (<a href="https://towardsdatascience.com/using-subgraphs-for-more-expressive-gnns-8d06418d5ab">covered extensively in the recent post by Michael Bronstein</a>); 4️⃣ GNNs with marking — those are node/edge perturbation approaches and node/edge labeling approaches <strong>(M)</strong>. Then, the authors come up with the theoretical framework of how all those <strong>k-WL, S, N, and M</strong> families are related and which one is more powerful to what extent. The hierarchy is more fine-grained than k-WL to help designing GNNs expressive enough to cover particular downstream tasks and save compute.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/822/0*qF6OqUtucFYKFZEw" /><figcaption>The proposed hierarchy of different expressive GNN families. N=subgraph GNNs, S=substructure counting, M=GNNs with markings. Source: <a href="https://proceedings.mlr.press/v162/papp22a/papp22a.pdf">Papp and Wattenhofer</a></figcaption></figure><p>➡️ Perhaps the tastiest ICML’22 work is cooked by chefs <a href="https://proceedings.mlr.press/v162/morris22a/morris22a.pdf">Morris et al</a> with 🥓<a href="https://github.com/chrsmrrs/speqnets">SpeqNets</a> 🥓(<em>Speck</em> is <em>bacon</em> in German). Known higher-order k-WL GNNs either operate on k-order tensors or consider all <em>k</em>-node subgraphs, implying an exponential dependence on <em>k</em> in memory requirements and do not adapt to the sparsity of the graph. <strong>SpeqNets</strong> introduce a new class of heuristics for the graph isomorphism problem, the <strong>(k,s)-WL</strong>, which offers a more fine-grained control between expressivity and scalability.</p><p>​​Essentially, the algorithm is a variant of the <a href="https://arxiv.org/abs/1904.01543">local k-WL</a> but only considers specific tuples to avoid the exponential memory complexity of the k-WL. Concretely, the algorithm only considers <strong>k-tuples</strong> or subgraphs on k nodes with at most <strong>s connected</strong> components, effectively exploiting the potential sparsity of the underlying graph — varying <strong>k</strong> and <strong>s</strong> leads to a tradeoff between scalability and expressivity on the theoretical side.</p><p>The authors derive a new hierarchy of permutation-equivariant graph neural networks, denoted <strong>SpeqNets</strong>, based on the above combinatorial insights, reaching universality in the limit. These architectures vastly reduce computation times compared to standard higher-order graph networks in the supervised node- and graph-level classification and regression regime, significantly improving standard graph neural network and graph kernel architectures in predictive performance.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*GGXN-0JHu6jnkz5E" /><figcaption>The hierarchy of 🥓 SpeqNets 🥓. Source: <a href="https://proceedings.mlr.press/v162/morris22a/morris22a.pdf">Morris et al</a></figcaption></figure><p>➡️ Next, <a href="https://proceedings.mlr.press/v162/huang22l/huang22l.pdf">Huang et al</a> take an unorthodox look at permutation-invariant GNNs and suggest that carefully designed<strong> permutation-sensitive</strong> GNNs are actually more expressive. The theory of <a href="https://openreview.net/forum?id=BJluy2RcFm">Janossy pooling</a> says a model becomes invariant to a group of transformations if we show all possible examples of such a transformation, and for permutation of <em>n</em> elements we have an intractable <em>n!</em> permutations. Instead, the authors show that considering only pairwise 2-ary permutations of a node’s neighborhood is enough and is provably more powerful than 2-WL and not less powerful than 3-WL.</p><p>Practically, the proposed <a href="https://github.com/zhongyu1998/PG-GNN"><strong>PG-GNN</strong></a> extends the idea of GraphSAGE and encodes every random permutation of node’s neighborhood through a 2-layer LSTM instead of traditional <em>sum/mean/min/max</em>. Additionally, the authors design a linear permutation sampling approach based on Hamiltonian cycles.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*Rh17chscpE5r5UbK" /><figcaption>PG-GNN permutation-sensitive aggregation idea. Source: <a href="https://proceedings.mlr.press/v162/huang22l/huang22l.pdf">Huang et al</a></figcaption></figure><p>Some other interesting works you might want to check:</p><ul><li><a href="https://proceedings.mlr.press/v162/cai22b/cai22b.pdf">Cai and Wang</a> study convergence properties of <a href="https://arxiv.org/abs/1812.09902">Invariant Graph Networks</a>, different from vanilla MPNNs in that they operate on node and edge features as equivariant operations over monolithic tensors. Based on the <a href="https://en.wikipedia.org/wiki/Graphon">graphon</a> theory, the authors find a class of IGNs that provably converge. More technical details are in the <a href="https://twitter.com/ChenCaiUCSD/status/1550109192803045376">awesome Twitter thread</a>!</li><li><a href="https://proceedings.mlr.press/v162/gao22e/gao22e.pdf">Gao and Ribeiro</a> study ⏳ temporal GNNs ⏳devising two families: (1) <em>time-and-graph</em> — where we first embed graph snapshots via some GNN and then apply some RNN; (2) <em>time-then-graph </em>where we first encode all node and edge features (over a unified graph of all snapshots) through an RNN, and only then apply a single GNN pass, e.g., <a href="https://arxiv.org/abs/2006.10637">TGN</a> and <a href="https://openreview.net/forum?id=rJeW1yHYwH">TGAT</a> can be considered instances of this family. Theoretically, the authors find that <em>time-then-graph</em> are more expressive then <em>time-and-graph</em> when using standard 1-WL GNN encoders like GCN or GIN, and propose a simple model with a GRU time encoder and a GCN graph encoder. The model shows very competitive performance on temporal node classification and regression tasks being 3–10x faster and GPU memory-efficient. Interestingly, the authors find that <strong>neither</strong> <em>time-and-graph</em> nor <em>time-then-graph</em> <strong>is expressive enough</strong> for temporal link prediction 🤔.</li><li>Finally, “<em>Weisfeiler-Lehman Meets Gromov-Wasserstein</em>” by <a href="https://proceedings.mlr.press/v162/chen22o/chen22o.pdf">Chen, Lim, Mémoli, Wan, Wang</a> (a joint 5-first authors paper 👀) derives a polynomial-time <a href="https://github.com/chens5/WL-distance">WL distance</a> from the WL kernel such that we can measure a dissimilarity of two graphs — the WL distance is 0 if and only if they cannot be distinguished by the WL test, and positive iff they can be distinguished. The authors further realize that the proposed WL distance has deep connections to the <a href="https://arxiv.org/abs/1808.04337">Gromov-Wasserstein distance</a>!</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/497/0*ybT4mgocmE8xHq5k" /><figcaption>How Weisfeiler-Leman meets Gromov-Wasserstein in practice. Should have been in the paper by <a href="https://proceedings.mlr.press/v162/chen22o/chen22o.pdf">Chen, Lim, Mémoli, Wan, Wang</a>. Source: <a href="https://tenor.com/view/predator-arnold-schwarzenegger-hand-shake-arms-gif-3468629">Tenor</a></figcaption></figure><h3>Spectral GNNs</h3><p>➡️ Spectral GNNs tend to be overlooked in the mainstream of spatial GNNs , but now there is a reason for you to take a look at spectral GNNs 🧐. In “<a href="https://proceedings.mlr.press/v162/wang22am/wang22am.pdf"><em>How Powerful are Spectral Graph Neural Networks</em></a>” by <a href="https://proceedings.mlr.press/v162/wang22am/wang22am.pdf">Wang and Zhang</a>, the authors show that a linear spectral GNN is a universal approximator for any function on a graph under some mild assumptions. What’s even exciting is that the assumptions turn out to be empirically true for real-world graphs, suggesting that a linear spectral GNN is <strong>powerful enough</strong> for the node classification task.</p><p>But how do we explain the difference in the empirical results of spectral GNNs? The authors prove that different parameterizations (specifically, polynomial filters) of the spectral GNNs influence the convergence speed. We know that the condition number of Hessian matrix (how round is the iso-loss line) is highly related to the convergence speed. Based on this intuition, the authors come up with some orthogonal polynomials to benefit optimization. The polynomials, named as <a href="https://en.wikipedia.org/wiki/Jacobi_polynomials">Jacobi bases</a>, are a generalization of the <a href="https://en.wikipedia.org/wiki/Chebyshev_polynomials">Chebyshev bases</a> used in <a href="https://proceedings.neurips.cc/paper/2016/file/04df4d434d481c5bb723be1b6df1ee65-Paper.pdf">ChebyNet</a>. Jacobi bases are defined by two hyperparameters, <em>a</em> and <em>b</em>. By tuning these hyperparameters, one may find a group of bases in favor of the input graph signal.</p><p>Experimentally, <strong>JacobiConv</strong> performs well on both homophilic and heterophilic datasets, even as a linear model. Probably it’s the time to desert those gaudy GNNs, at least for the node classification task 😏.</p><p>➡️ There are two more papers on spectral GNNs. One is <a href="https://proceedings.mlr.press/v162/li22h/li22h.pdf">Graph Gaussian Convolutional Networks</a> (G2CN) based on spectral concentration analysis that shows good results on heterophilic datasets. The other one from <a href="https://proceedings.mlr.press/v162/yang22n/yang22n.pdf">Yang et al</a> analyzes the correlation issue in graph convolutions based on spectral smoothness that shows an exceptionally good result of <strong>0.0698 </strong>MAE on ZINC.</p><h3>Explainable GNNs</h3><p>As most GNN models are black boxes, it is important to explain the predictions of GNNs for applications in crucial areas. This year we have two awesome papers in this direction, an efficient and powerful post-hoc model from <a href="https://proceedings.mlr.press/v162/xiong22a/xiong22a.pdf">Xiong et al</a>, and an inherently interpretable model from <a href="https://proceedings.mlr.press/v162/miao22a/miao22a.pdf">Miao et al</a>.</p><p>➡️<a href="https://proceedings.mlr.press/v162/xiong22a/xiong22a.pdf"> Xiong et al</a> extend their previous GNN explanation method, <a href="https://arxiv.org/pdf/2006.03589.pdf">GNN-LRP</a>, to be way more scalable. Unlike other methods (<a href="https://arxiv.org/pdf/1903.03894.pdf">GNNExplainer</a>, <a href="https://arxiv.org/pdf/2011.04573.pdf">PGExplainer</a>, <a href="https://arxiv.org/pdf/2010.05788.pdf">PGM-Explainer</a>), <a href="https://arxiv.org/pdf/2006.03589.pdf">GNN-LRP</a> is a higher-order subgraph attribution method that considers the joint contribution of nodes in a subgraph. Such a property is necessary for tasks where a subgraph is not simply a set of nodes. For example, in molecules, a subgraph of six carbons (hydrogens are ignored) can be either a benzene (a ring) or a hexane (a chain). As shown in the below figure, a higher-order method can figure out such subgraphs (right) while a lower-order method (left) may not.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*dR_GvltDxvRKpSCh" /><figcaption>Source: <a href="https://proceedings.mlr.press/v162/xiong22a/xiong22a.pdf">Xiong et al</a>.</figcaption></figure><p>However, the drawback of GNN-LRP is that it needs to compute the gradient w.r.t. each random walk in a subgraph, which takes <em>O(|S|L)</em> for a subgraph <em>S</em> and <em>L</em>-hop random walks. Here dynamic programming comes to the rescue 😎. Notice that the gradient w.r.t. a random walk is multiplicative (chain rule), and different random walks are aggregated by summation. This can be efficiently computed by the sum-product algorithm. The idea is to use the distributive property of summation over multiplication (more generally, <a href="https://en.wikipedia.org/wiki/Semiring">semiring</a>), and aggregate partial random walks at each step. This constitutes the model, <a href="https://github.com/xiong-ping/sgnn_lrp_via_mp"><strong>subgraph GNN-LRP (sGNN-LRP)</strong></a>.</p><p><strong>sGNN-LRP</strong> also improves over GNN-LRP with a generalized subgraph attribution, which considers both random walks in the subgraph <em>S</em> and its complement <em>G\S</em>. Though complicated as it looks, the generalized subgraph attribution can be computed by two sum-product algorithm passes. Experimentally, <strong>sGNN-LRP</strong> not only finds attributions better than all existing explanation methods, but also runs as fast as a regular message passing GNN. Might be a useful tool for interpretation and visualization! 🔨</p><p>💡 By the way, it is not new to see that models based on random walks are more expressive than simple node or edge models. The NeurIPS’21 paper <a href="https://papers.nips.cc/paper/2021/file/f6a673f09493afcd8b129a0bcf1cd5bc-Paper.pdf">NBFNet</a> solves knowledge graph reasoning with random walks and dynamic programming, and achieves amazing results in both transductive and inductive settings.</p><p>➡️ <a href="https://proceedings.mlr.press/v162/miao22a/miao22a.pdf">Miao et al</a> take another perspective and study inherently interpretable GNN models. They show that post-hoc explanation methods, such as <a href="https://arxiv.org/pdf/1903.03894.pdf">GNNExplainer</a>, are subpar for interpretation since they merely use a fixed pretrained GNN model. By contrast, an inherently interpretable GNN that jointly optimizes the predictor and the interpretation modules, is a better solution. Following this idea, the authors derive <a href="https://github.com/Graph-COM/GSAT"><strong>graph stochastic attention (GSAT)</strong></a> from the graph information bottleneck (<strong>GIB</strong>) principle. <strong>GSAT</strong> encodes the input graph, and randomly samples a subgraph (interpretation) from the posterior distribution. It makes the prediction based on the sampled subgraph. As an advantage, <strong>GSAT</strong> doesn’t need to constrain the size of a sampled subgraph.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*qBxaIzgzWvEs8UA7" /><figcaption>Source: <a href="https://proceedings.mlr.press/v162/miao22a/miao22a.pdf">Miao et al</a></figcaption></figure><p>Experimentally, <strong>GSAT</strong> is much better than post-hoc methods in terms of both interpretation and prediction performance. It can also be coupled with a pretrained GNN model. GSAT should be a good candidate if you are building interpretable GNNs for your applications.</p><h3>Graph Augmentation: Beyond Edge Dropout</h3><p>This year brought a few works on improving self-supervised capabilities of GNNs that go beyond random edge index perturbations like node/edge dropout.</p><p>➡️ <a href="https://arxiv.org/pdf/2202.07179.pdf">Han et al </a>bring the idea of <a href="https://github.com/facebookresearch/mixup-cifar10">mixups</a> used in image augmentation since 2017 to graphs with <strong>G-Mixup</strong> (Outstanding Paper Award at ICML 2022 🏅). The idea of mixups is to take two images, mix their features together and mix their labels together (according to a pre-defined weighting factor), and ask the model to predict this label. Such a mixup improves robustness and generalization qualities of classifiers.</p><blockquote>But how do we mix two graphs that in general might have different numbers of nodes and edges?</blockquote><p>The authors find the elegant answer — let’s mix not graphs, but their <a href="https://en.wikipedia.org/wiki/Graphon">graphons</a> — that are, in simple words, graph generators. Graphs coming from the same generator have the same underlying graphon. So the algorithm becomes rather straightforward (see the illustration below) — for a pair of graphs, we 1️⃣ estimate their graphons; 2️⃣ mix up two graphons into a new one through a weighed sum; 3️⃣ sample a graph from the new graphon and its new label; and 4️⃣ send this to a classifier. In the illustrative example, we have two graphs with 2 and 8 connected components, respectively, and after mixing their graphons we get a new graph of 2 major communities with 4 minor in each. Estimating graphons can be done with a step function and several methods different in computational complexity (the authors mostly resort to <a href="https://arxiv.org/abs/1110.6517">“largest gap”</a>).</p><p>Experimentally, <strong>G-Mixup</strong> stabilizes model training, performs better or on-par with traditional node/edge perturbation methods, but outperforms them by a large margin in the robustness scenarios with label noise or many added/removed edges. Cool adaptation of a well-known augmentation method to graphs 👏! If you are interested, ICML’22 offers a few more general works on mixups: a <a href="https://proceedings.mlr.press/v162/zhang22f.html">study</a> how mixups improve calibration and <a href="https://proceedings.mlr.press/v162/sohn22a/sohn22a.pdf">how to use them</a> in generative models.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*pLo1sPxfRu50w2SO" /><figcaption>G-Mixup. Source: <a href="https://arxiv.org/pdf/2202.07179.pdf">Han et al</a></figcaption></figure><p>➡️ <a href="https://arxiv.org/pdf/2109.03856.pdf">Liu et al</a> take another look at augmentation, particularly in setups where nodes have a small neighborhood. The idea of <a href="https://github.com/SongtaoLiu0823/LAGNN"><strong>Local Augmentation GNNs (LA-GNN)</strong></a> is in training a generative model to yield an additional feature vector for each node. The generative model is a conditional VAE trained (on the whole graph) to predict features of connected neighbors conditioned on a center node. That is, once CVAE is trained, we just pass a feature vector of each node and get another feature vector that is supposed to capture more information than plain neighbors.</p><p>We then concat two feature vectors per node and send it to any downstream GNN and task. Note that CVAE is pre-trained before and doesn’t need to be trained with a GNN. Interestingly, CVAE can generate features for unseen graphs, i.e., local augmentation can be used in inductive tasks as well! The initial hypothesis is confirmed experimentally — the augmentation approach works particularly well for nodes of small degrees.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*VLK_Q-Jz3bpO5HXy" /><figcaption>The Local Augmentation idea. Source: <a href="https://arxiv.org/pdf/2109.03856.pdf">Liu et al</a></figcaption></figure><p>➡️ Next, <a href="https://arxiv.org/pdf/2206.07161.pdf">Yu, Wang, Wang, et al</a>, tackle the GNN scalability task where using standard neighbor samplers a-la GraphSAGE might lead to exponential neighborhood size expansion and stale historical embeddings. The authors propose <a href="https://github.com/divelab/DIG/tree/dig/dig/lsgraph"><strong>GraphFM</strong></a>, a feature momentum approach, where historical node embeddings get updates from their 1-hop neighbors through a momentum step. Generally, momentum updates are often seen in SSL approaches like <a href="https://papers.nips.cc/paper/2020/file/f3ada80d5c4ee70142b17b8192b2958e-Paper.pdf">BYOL</a> and <a href="https://arxiv.org/abs/2102.06514">BGRL</a> for updating model parameters of a <em>target</em> network. Here, GraphFM employs momentum to alleviate the variance of historical reprepsentations in different mini-batch sizes and provide an unbiased estimation of feature updates for differently-sized neighborhoods.</p><p>Generally, GraphFM has two options <strong>GraphFM-<em>InBatch</em></strong> and <strong>GraphFM-<em>OutOfBatch</em></strong>. (1) GraphFM-InBatch works for the GraphSAGE-style neighbor sampling by dramatically reducing the number of necessary neighbors — whereas GraphSAGE required 10–20 depending on the level, GraphFM needs only 1 random neighbor per node per layer. Only one 👌! And (2) GraphFM Out-of-batch builds on top of <a href="https://arxiv.org/pdf/2106.05609.pdf">GNNAutoScale</a> where we first apply graph partitioning to cut graphs into k minibatches.</p><p>Experimentally, feature momentum looks especially useful for SAGE-style sampling (the in-batch version) — seems like a good default choice for all neighbor sampling-based approaches!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*721I25AT_kQq-dhI" /><figcaption>Compared to <a href="https://arxiv.org/pdf/2106.05609.pdf">GNNAutoScale (GAS)</a>, historical node states are also updated from new embeddings and feature momentum (moving average). Source: <a href="https://arxiv.org/pdf/2206.07161.pdf">Yu, Wang, Wang, et al</a></figcaption></figure><p>➡️ Finally, <a href="https://arxiv.org/pdf/2106.02172.pdf">Zhao et al</a> propose a clever augmentation trick for link prediction based on counterfactual links. In essence, the authors ask:</p><blockquote>“would the link still exist if the graph structure became different from observation?”</blockquote><p>It means that we would like to find links that are structurally similar to the given link according to some 💊 <em>treatment</em> (here, those are classical metrics like SBM clustering, k-core decomposition, Louvain, and more) but give the opposite result. With <a href="https://github.com/DM2-ND/CFLP"><strong>CFLP</strong></a>, the authors hypothesize that training a GNN to correctly predict both true and counterfactual links helps the model to get rid of spurious correlations and capture only meaningful features for inferring a link between two nodes.</p><p>After obtaining a set of counterfactual links (pre-processing step based on the chosen <em>treatment function</em>) <strong>CFLP</strong> is first trained on both factual and counterfactual links, then the link prediction decoder is fine-tuned with some balancing and regularization terms. In some sense, the approach resembles mining hard negatives to augment the set of true positive links 🤔Experimentally, <strong>CFLP</strong> paired with a GNN encoder largely outperforms results of that single GNN encoder on Cora/Citeseer/Pubmed, and is still in <a href="https://ogb.stanford.edu/docs/leader_linkprop/">top-3 of OGB-DDI</a> link prediction task!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*1Ww02qcZ0CKcq-fM" /><figcaption>Counterfactual links (right). Source: <a href="https://arxiv.org/pdf/2106.02172.pdf">Zhao et al</a></figcaption></figure><h3>Algorithmic Reasoning and Graph Algorithms</h3><p>🎆 A huge milestone for the algorithmic reasoning community — the appearance of the <a href="https://github.com/deepmind/clrs"><strong>CLRS benchmark</strong></a> (named after a classical textbook <a href="https://en.wikipedia.org/wiki/Introduction_to_Algorithms">Introduction to Algorithms</a> by Cormen, Leiserson, Rivest, and Stein) by <a href="https://arxiv.org/pdf/2205.15659.pdf">Veličković et al</a>! Now, there is no need to invent toy evaluation tasks — CLRS contains 30 classical algorithms (sort, search, MST, shortest paths, graphs, dynamic programming, and many more) converting an <a href="https://icpc.global/">ICPC</a> data generator into an ML dataset 😎.</p><p>In <strong>CLRS</strong>, each dataset element is a <em>trajectory</em>, i.e., a collection of inputs, outputs, and intermediate steps. The underlying representation format is a set of nodes (not often a graph as edges might not be necessary), for example, sorting a list of 5 elements is framed as operations over a set of 5 nodes. Trajectories consist of <em>probes</em> — tuples of format (stage, location, type, values) that encode a current execution step of an algorithm with its states. Output decoder depends on the expected type — in the example illustration 👇 sorting is modeled with pointers.</p><p>Split-wise, training and validation trajectories have 16 nodes (e.g., sort lists of length 16), but the test set probes out-of-distribution (OOD) capabilities of models on tasks with 64 nodes. Interestingly, vanilla GNNs and MPNNs fit training data very well but underperform in the OOD setup where <a href="https://proceedings.neurips.cc//paper/2020/file/176bf6219855a6eb1f3a30903e34b6fb-Paper.pdf">Pointer Graph Network</a> shows better numbers. It is a one more data point to the collection of observations that GNNs can’t generalize to larger inference graphs — it’s still an open question how to fix this 🤔 . The code is <a href="https://github.com/deepmind/clrs">already available</a> and could be extended with more custom algorithmic tasks.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*Qc3fqqryPRbinpFl" /><figcaption>Representation of hints in CLRS. Source: <a href="https://arxiv.org/pdf/2205.15659.pdf">Veličković et al</a></figcaption></figure><p>➡️ On a more theoretical side, <a href="https://proceedings.mlr.press/v162/sanmarti-n22a/sanmarti-n22a.pdf">Sanmartín et al</a> generalize the notion of graph metrics through the <a href="https://www.youtube.com/watch?v=ZzBWh6orSHk">Algebraic Path Problem</a> (APP). APP is a more high-level framework (with<a href="https://arxiv.org/abs/2005.06682"> some roots</a> in the category theory) unifying many existing graph metrics like shortest path, <a href="https://en.wikipedia.org/wiki/Cost_distance_analysis">commute cost distance</a>, and minimax distance through the notion of semirings — algebraic structures over sets with specific operators and properties. For instance, shortest paths can be described as a semiring with “<em>min</em>” and “<em>+</em>” operators with neutral elements “<em>+inf</em>” and “<em>0</em>”.</p><p>Here, the authors create a single APP framework of <strong>log-norm distances</strong> that allows to interpolate between shortest paths, commute costs, and minimax using only two parameters. In essence, you could vary and mix the influence of edge weights and surrounding graph structure (other paths) on the final distance. Although there are no experiments, this is a solid theoretical contribution — if you are learning category theory as “eating your veggies” 🥦, this paper is a blast to read — and will surely find applications in GNNs. 👏</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/570/0*Rxb2p7BVGojtW4Dj" /><figcaption>Log-norm distances. Source: <a href="https://proceedings.mlr.press/v162/sanmarti-n22a/sanmarti-n22a.pdf">Sanmartín et al</a></figcaption></figure><p>➡️ Finally, we’d add to this category a work <a href="https://arxiv.org/pdf/2206.08119.pdf"><em>“Learning to Infer the Structures of Network Games”</em></a> by <strong>Rossi et al</strong> who combine graph theory with game theory. Game theory is used a lot in economics and other multidisciplinary studies, you’ve probably heard about the <a href="https://en.wikipedia.org/wiki/Nash_equilibrium">Nash Equilibrium</a> that defines a solution for non-cooperative games. In this work, the authors consider 3 game types: <em>linear quadratic</em>, <em>linear influence</em>, and <em>Barik-Honorio graphical games</em>. Games are usually defined through their utility functions, but in this work, we assume we don’t know anything about game’s utility function.</p><p>Games are defined as N players (nodes in a graph) that take specific actions (for simplicity, let’s say we can describe it with a certain numerical feature, check the illustration below 🖼️). Actions can influence neighboring players — and the task is framed as inferring the graph of players given their actions. In essence, this is a graph generation task — given node features X, predict a (normalized) adjacency matrix A. Usually, a game is played K times, and those are independent games, so the encoder model should be invariant to permutations of games (and equivariant to permutation of nodes in each game). The authors propose the <strong>NuGgeT</strong> 🍗 encoder-decoder model where a transformer encoder processes K games by N player, yields latent representations, and decoder is an MLP over a sum of a Hadamard product of latent pairwise player features such that the decoder is permutation-invariant to the order of K games.</p><p>Experimentally, the model works well on both synthetic and real datasets. The paper is definitely a “broaden your horizon” 🔭 work that you might not expect to see at ICML, but later find a fascinating reading and learning a lot of new concepts 👏.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*QdpAB57Qa1a0744B" /><figcaption>Source: <a href="https://arxiv.org/pdf/2206.08119.pdf">Rossi et al</a></figcaption></figure><h3>Knowledge Graph Reasoning</h3><p>Knowledge graph reasoning has long been a playground for GraphML methods. In this year’s ICML, there are quite a few interesting papers on this topic. As a trend of this year, we see a significant drift from embedding methods (<a href="https://proceedings.neurips.cc/paper/2013/file/1cecc7a77928ca8133fa24680a88d2f9-Paper.pdf">TransE</a>, <a href="http://proceedings.mlr.press/v48/trouillon16.pdf">ComplEx</a>, <a href="https://arxiv.org/pdf/1902.10197.pdf">RotatE</a>, <a href="https://ojs.aaai.org/index.php/AAAI/article/view/5701/5557">HAKE</a>) to GNNs and logic rules (in fact, GNNs are also <a href="https://openreview.net/pdf?id=r1lZ7AEKvB">related to logic rules</a>). There are four papers based on GNNs or logic rules, and two papers extending the conventional embedding methods.</p><p>➡️ Let’s begin with the <a href="https://github.com/pkuyzy/CBGNN"><strong>cycle basis GNN (CBGNN)</strong></a> proposed by <a href="https://proceedings.mlr.press/v162/yan22a/yan22a.pdf">Yan et al</a>. The authors draw an interesting connection between logic rules and cycles. For any chain-like logic rule, the head and the body of the logic rule always form a cycle in the knowledge graph. For example, the right plot of the following figure shows the cycle for (X, part of Y) ∧ (X, lives in, Z) → (Y, located in Z). In other words, the inference of a logic rule can be viewed as predicting the plausibility of a cycle, which boils down to learning the representations of cycles.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*Vc0atVKc1jvi5cPc" /><figcaption>Blue and Red triangles are cycles within the bigger Green cycle. Source: <a href="https://proceedings.mlr.press/v162/yan22a/yan22a.pdf">Yan et al</a></figcaption></figure><p>An interesting observation is that cycles form a linear space under <em>modulo-2</em> addition and multiplication. In the above example, the summation of the red ❤️ and blue 💙 cycles, which cancels out their common edge, results in the green 💚 cycle. Therefore, we don’t need to learn the representation of all cycles, but instead, only a <strong>few cycle bases</strong> of the linear space. The authors generate the cycle bases by picking cycles that have large overlapping with the shortest path tree. To learn the representation of cycles, they create a cycle graph, where each node is a cycle in the original graph, and each edge indicates overlapping between two cycles. A GNN is applied to learn the node (which are cycles of the original graph) representations in the cycle graph.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*G3ePyiK1NED60ZU3" /><figcaption>CBGNN encoding. Source: <a href="https://proceedings.mlr.press/v162/yan22a/yan22a.pdf">Yan et al</a></figcaption></figure><p>To apply <strong>CBGNN</strong> to inductive relation prediction, the authors construct an inductive input representation for each cycle by encoding the relations in the cycle with an LSTM. Experimentally, CBGNN achieves SotA results on the inductive version of FB15k-237/WN18RR/NELL-995.</p><p>➡️ Next, <a href="https://proceedings.mlr.press/v162/das22a/das22a.pdf">Das and Godbole et al</a> propose <a href="https://github.com/rajarshd/CBR-SUBG"><strong>CBR-SUBG</strong></a>, a case-based reasoning (CBR) method for KBQA. The core idea is to retrieve similar query-answer pairs from the training set when solving a query. We know the idea of retrieval is very popular in OpenQA task (<a href="https://arxiv.org/pdf/2106.05346.pdf">EMDR</a>, <a href="https://arxiv.org/abs/2005.11401">RAG</a>, <a href="https://arxiv.org/pdf/2010.12688.pdf">KELM</a>, <a href="https://openreview.net/forum?id=OY1A8ejQgEX">Mention Memory LMs</a>), but this is the first time to see such an idea adopted on graphs.</p><p>Given a natural language query, CBR first retrieves similar k-nearest neighbor (kNN) queries based on the query representation encoded by a pretrained language model. All the retrieved queries are from the training set, and therefore their answers are accessible. Then we generate a local subgraph for each query-answer pair, which is believed to be the reasoning pattern (though not necessarily exact) for the answer. The local subgraph of the current query (for which we can’t access the answer) is generated by following the relation paths in the subgraphs of its kNN queries. <strong>CBR-SUBG</strong> then applies a GNN to every subgraph, and predicts the answer by comparing the node representations with answers in the KNN queries.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*EWoo_QevY7Ohfyt3" /><figcaption>Case-based reasoning intuition. Source: <a href="https://proceedings.mlr.press/v162/das22a/das22a.pdf">Das and Godbole et al</a></figcaption></figure><p>➡️ There are two neural-symbolic methods for reasoning this year. The first one is <a href="https://github.com/claireaoi/hierarchical-rule-induction"><strong>hierarchical rule induction (HRI)</strong></a> from <a href="https://proceedings.mlr.press/v162/glanois22a/glanois22a.pdf">Glanois et al</a>. HRI extends a previous work, <a href="https://arxiv.org/pdf/1809.02193.pdf">logic rule induction (LRI)</a> on inductive logic programming. The idea of rule induction is to learn a bunch of rules and apply them to deduce facts like <a href="https://en.wikipedia.org/wiki/Forward_chaining">forward chaining</a>.</p><p>In both <strong>LRI</strong> and <strong>HRI</strong>, each fact P(s,o) is represented by a predicate embedding <em>𝜃p</em> and a valuation vp (i.e. the probability of the fact being true). Each rule P(X,Y) ← P1(X,Z) ∧ P2(Z,Y) is represented by the embeddings of its predicates. The goal is to iteratively apply rules to deduce new facts. During each iteration, the rules and facts are matched through soft unification, which measures whether two facts satisfy certain rules in the embedding space. Once a rule is selected, a new fact is generated and added to the set of facts. All the embeddings and the soft unification operation are trained end-to-end to maximize the likelihood of observed facts.</p><p>The <strong>HRI</strong> model improves over the LRI model in three aspects: 1) use a hierarchical prior that separates the rules used in each iteration step. 2) use gumbel-softmax to induce a sparse and interpretable solution for soft unification. 3) prove the set of logic rules that HRI can express.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*iJAW14fQ4A7H9wHI" /><figcaption>Hierarchical Rule Induction. Source: <a href="https://proceedings.mlr.press/v162/glanois22a/glanois22a.pdf">Glanois et al</a></figcaption></figure><p>➡️ The second one is the GNN-QE paper from <a href="https://proceedings.mlr.press/v162/zhu22c/zhu22c.pdf">Zhu et al</a> (<strong>disclaimer</strong>: a paper from the authors of this blog post). GNN-QE solves complex logical query on knowledge graphs with GNNs and fuzzy sets. It enjoys the advantages of both neural (e.g. strong performance) and symbolic (e.g. interpretability) methods. As there is a lot of interesting stuff in GNN-QE, we will have a separate blog post for it soon. Stay tuned! 🤗</p><p>➡️ Finally, <a href="https://proceedings.mlr.press/v162/kamigaito22a/kamigaito22a.pdf">Kamigaito and Hayashi</a> study the theoretical and empirical effects of <strong>negative sampling</strong> in knowledge graph embeddings. Starting from <a href="https://arxiv.org/pdf/1902.10197.pdf">RotatE</a>, knowledge graph embedding methods use a normalized negative sampling loss, plus a margin binary cross entropy loss. This is different from the negative sampling used in the original <a href="https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf">word2vec</a>. In this paper, the authors prove that the normalized negative sampling loss is necessary for distance-based models (<a href="https://proceedings.neurips.cc/paper/2013/file/1cecc7a77928ca8133fa24680a88d2f9-Paper.pdf">TransE</a>, <a href="https://arxiv.org/pdf/1902.10197.pdf">RotatE</a>) to reach the optimal solution. The margin also plays an important role in distance-based models. The <strong>optimal solution</strong> can only be reached if <em>𝛾 ≥ log|V|</em>, which is consistent with the empirical results. Based on this conclusion, now we can determine the optimal margin without hyperparameter tuning! 😄</p><h3>Computational Biology: Molecular Linking, Protein Binding, Property Prediction</h3><p>Generally, comp bio is represented at ICML pretty well. Here, we’ll have a look at new approaches for <strong>molecular linking</strong>, <strong>protein binding</strong>, conformer generation, and molecular property prediction.</p><p><strong>Molecular linking</strong> is a crucial part in designing <a href="https://en.wikipedia.org/wiki/Proteolysis_targeting_chimera">Proteolysis targeting chimera (PROTAC)</a> drugs. For us, mere GNN researchers 🤓 without biological background, it means that given two molecules, we want to generate a valid <em>linker</em> molecule that would attach two <em>fragment</em> molecules in a single molecule while retaining all properties of the original fragment molecules (check the illustration below for a good example)</p><p>➡️ For generating molecular links, <a href="https://arxiv.org/pdf/2205.07309.pdf">Huang et al</a> created <strong>3DLinker</strong>, an E(3)-equivariant generative model (VAE) that sequentially generates atoms (and connecting bonds) with <strong>absolute</strong> coordinates. Often, equivariant models generate relative coordinates or relative distance matrices, but here, the authors aim at generating absolute <em>(x, y, z)</em> coordinates. To allow a model to generate exact coordinates from equivariant (to coordinates) and invariant (to nodes features) transformations, the authors apply a clever idea of <a href="https://arxiv.org/pdf/2104.12229.pdf">Vector Neurons</a> which is essentially a ReLU-like nonlinearity for preserving feature equivariance with clever orthogonal projection tricks.</p><p>The E(3)-equivariant encoder enriched with <strong>Vector Neurons</strong> encodes features and coordinates while the decoder sequentially generates the link in 3 steps (illustrated belos as well): 1️⃣ predict an anchor node to which the link will be attached; 2️⃣ predict node type for a linker node; 3️⃣ predict edge and its absolute coordinates; 4️⃣ repeat until we hit the stop node in the second fragment. <strong>3DLinker</strong> is (so far) the first equivariant model that generates the linker molecule with <strong>exact 3D coordinates</strong> and predicts the anchor points in fragment molecules — previous models required known anchors before generation — and shows the best experimental results.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*5sGzmdVs6oG1JtJ1" /><figcaption>3DLinker intuition. Source: <a href="https://arxiv.org/pdf/2205.07309.pdf">Huang et al</a></figcaption></figure><p>➡️<strong> Protein-ligand binding</strong> is the other crucial drug discovery task — predicting where a small molecule could potentially attach to a certain region of a bigger protein. First, <a href="https://arxiv.org/pdf/2202.05146.pdf">Stärk, Ganea, et al </a>create <a href="https://github.com/HannesStark/EquiBind"><strong>EquiBind</strong></a> (ICML Spotlight 💡) that takes as input a protein and a random RDKit conformer of a ligand graph, and outputs precise 3D location of the binding interaction. EquiBind has already garnered a very warm reception and publicity as in <a href="https://news.mit.edu/2022/ai-model-finds-potentially-life-saving-drug-molecules-thousand-times-faster-0712">MIT News</a> and <a href="https://www.youtube.com/watch?v=706KjyR-wyQ&amp;list=PLoVkjhDgBOt11Q3wu8lr6fwWHn5Vh3cHJ&amp;index=14">reading groups</a> so we encourage you to have a detailed look at technical details! <strong>EquiBind</strong> is orders of magnitude faster than commercial software while maintaining high prediction accuracy.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*2kndRG2X4H-MdRWm" /><figcaption>EquiBind. Source: <a href="https://arxiv.org/pdf/2202.05146.pdf">Stärk, Ganea, et al</a></figcaption></figure><p>➡️ If the binding molecule is unknown and we want to generate such a molecule, <a href="https://proceedings.mlr.press/v162/liu22m/liu22m.pdf">Liu et al</a> create <a href="https://github.com/divelab/GraphBP"><strong>GraphBP</strong></a>, an autoregressive molecule generation approach that takes as input a target protein site (denoted as initial context). Encoding the context with any 3D GNN (<a href="https://proceedings.neurips.cc/paper/2017/file/303ed4c69846ab36c2904d3ba8573050-Paper.pdf">SchNet</a> here), GraphBP generates atom type and spherical coordinates until there are no more contacting atoms available or the desired number of atoms is reached. Once the atoms are generated, the authors resort to <a href="https://jcheminf.biomedcentral.com/articles/10.1186/1758-2946-3-33">OpenBabel</a> to create bonds.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*y7Or_o3IBX2kkw4m" /><figcaption>Generating a binding molecule with GraphBP. Source: <a href="https://proceedings.mlr.press/v162/liu22m/liu22m.pdf">Liu et al</a></figcaption></figure><p>➡️ In<strong> molecular property prediction, </strong><a href="https://proceedings.mlr.press/v162/yu22a/yu22a.pdf">Yu and Gao</a> propose a simple and surprisingly powerful idea to enrich molecular representations with a bag of motifs. That is, they first mine a vocabulary of motifs in the training dataset and rank them according to <a href="https://en.wikipedia.org/wiki/Tf%E2%80%93idf">TF-IDF</a> scores (hello from NLP 😉). Then, each molecule can be represented as a bag of motifs (multi-hot encoding) and the whole dataset of molecules is converted to one heterogeneous graph with relations “motif-molecule” if any molecule contains this motif, and “motif-motif” if any two motifs share an edge in any molecule. Edge features are those TF-IDF scores mined before.</p><p>The final embedding of a molecule is obtained through a concatenation of any vanilla GNN over the molecule and another heterogeneous GNN over a sampled subgraph from the motif graph. Such a <a href="https://github.com/ZhaoningYu1996/HM-GNN"><strong>Heterogeneous Motif GNN (HM-GNN)</strong></a> consistently outperforms <a href="https://arxiv.org/abs/2006.09252">Graph Substructure Networks (GSN)</a>, one of the first GNN architectures that proposed to count triangles in social networks and k-cycles in molecules, and even <a href="https://arxiv.org/pdf/2106.12575.pdf">Cell Isomorphism Networks (CIN)</a>, a top-notch higher-order message passing model. HM-GNNs can serve as a simple powerful baseline for subsequent research in the area of higher-order GNNs 💪.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*nu5xOxemZSCuJZmp" /><figcaption>Building a motif vocabulary in HM-GNN. Source: <a href="https://proceedings.mlr.press/v162/yu22a/yu22a.pdf">Yu and Gao</a></figcaption></figure><p>➡️ Finally, a work by <a href="https://proceedings.mlr.press/v162/stark22a/stark22a.pdf">Stärk et al</a> demonstrates the benefits of pre-training GNNs both on 2D molecular graphs and their 3D conformers with the <a href="https://github.com/HannesStark/3DInfomax"><strong>3D Infomax</strong></a> approach. The idea of <strong>3D Infomax</strong> is in maximizing mutual information between 2D and 3D representations such that at inference time over 2D graphs, when no 3D structure is given, the model could still benefit from implicit knowledge of the 3D structure.</p><p>For that, 2D molecules are encoded with the <a href="https://arxiv.org/abs/2004.05718">Principal Neighborhood Aggregation (PNA)</a> net, 3D conformers are encoded with the <a href="https://openreview.net/forum?id=givsRXsOt9r">Spherical Message Passing (SMP)</a> net, we take the cosine similarity of their representations and pass through the contrastive loss maximizing the similarity of a molecule with its true 3D conformers and treating other samples as negatives. Having pre-trained 2D and 3D nets, we can fine-tune the weights of the 2D net on a downstream task — QM9 property prediction in this case — and the results definitely show that pretraining works. By the way, if you are further interested in pre-training, you can check out <a href="https://openreview.net/forum?id=xQUe1pOKPam">GraphMVP</a> published at ICLR 2022 as another 2D/3D pre-training approach.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*3UpGqTxkVbAGVQGm" /><figcaption>In 3D Informax, we first pre-train 2D and 3D nets, and use a trained 2D net at inference time. Source: <a href="https://proceedings.mlr.press/v162/stark22a/stark22a.pdf">Stärk et al</a></figcaption></figure><h3>Cool Graph Applications</h3><p>Physical simulation along with molecular dynamics received a huge boost with GNNs. A standard setup of physical simulation is a system of particles where node features are several recent velocities and edge features are relative displacements, and the task is to predict where the particles move at the next time step.</p><p>⚛️ This year, <a href="https://proceedings.mlr.press/v162/rubanova22a/rubanova22a.pdf">Rubanova, Sanchez-Gonzalez et al</a> further improve physical simulations by incorporating explicit scalar constraints in the <strong>C-GNS</strong> <strong>(Constraint-based Graph Network Simulator)</strong>. Conceptually, the output of an MPNN encoder is further refined through a solver that minimizes some learned (or specified at inference time) constraint. The solver itself is a differentiable function (5-iteration gradient descent in this case) so we can backprop through the solver as well. C-GNS is inherently connected to <a href="http://implicit-layers-tutorial.org/">deep implicit layers</a> that are getting more and more visibility including <a href="https://fabianfuchsml.github.io/equilibriumaggregation/">the GNN applications</a>.</p><p>Physical simulation works are often a source of fancy simulation visualizations — check out the <a href="https://sites.google.com/view/constraint-based-simulator">website with video demos</a>!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/0*0zUhtBrLrdGLJS7z" /><figcaption>Constraint-based Graph Network Simulator. Source: <a href="https://proceedings.mlr.press/v162/rubanova22a/rubanova22a.pdf">Rubanova, Sanchez-Gonzalez et al</a></figcaption></figure><p>A few other cool applications you might want to have a look at:</p><ul><li><strong>Traffic Prediction</strong>: <a href="https://proceedings.mlr.press/v162/lan22a/lan22a.pdf">Lan, Ma, et al</a> created <a href="https://github.com/SYLan2019/DSTAGNN"><strong>DSTA-GNN</strong></a> (Dynamic Spatial-Temporal Aware Graph Neural Network) for traffic prediction 🚥evaluated on real-world datasets of busy California roads — predicting traffic with graphs received a boost last year after the massive work by Google and DeepMind’s on improving Google Maps ETA <a href="https://towardsdatascience.com/graph-ml-in-2022-where-are-we-now-f7f8242599e0#2ddd">which we covered in 2021 results</a>.</li><li><strong>Neural Network Pruning</strong>: <a href="https://proceedings.mlr.press/v162/yu22e/yu22e.pdf">Yu et al</a> design <a href="https://github.com/yusx-swapp/GNN-RL-Model-Compression"><strong>GNN-RL</strong></a> to iteratively prune weights of deep neural nets given a desired ratio of FLOPs reduction. For that, the authors treat a neural net’s computational graph as a hierarchical graph of blocks and send it to a hierarchical GNN (with intermediate learnable pooling to coarse-grain the NN architecture). Encoded representations are sent to the RL agent that decides which block to prune.</li><li><strong>Ranking</strong>: <a href="https://arxiv.org/pdf/2202.00211.pdf">He et al</a> tackle an interesting task — given a matrix of pairwise interactions, e.g., between teams in a football league where <em>Aij &gt; 0 </em>means team <em>i</em> got a better score than team <em>j</em>, find the final ranking of nodes (teams) who scored best. In other words, we want to predict who is the winner of a league after seeing pair-wise results of all games. The authors propose <a href="https://github.com/SherylHYX/GNNRank"><strong>GNNRank</strong></a> that represents pairwise results as a directed graph and applies a directional GNN to get latent node states and compute the <a href="https://en.wikipedia.org/wiki/Algebraic_connectivity">Fiedler vector</a> of the graph Laplacian. Then, they frame the task as a constrained optimization problem with <em>proximal</em> gradient steps as we can’t easily backprop through the computation of the Fiedler vector.</li></ul><p>That’s finally it for ICML 2022! 😅</p><p>Looking forward to seeing NeurIPS 2022 papers as well as submissions to the brand-new <a href="https://logconference.org/"><strong>Learning on Graphs (LoG)</strong></a><strong> </strong>conference!</p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=252f39865c70" width="1" height="1" alt=""><hr><p><a href="https://medium.com/data-science/graph-machine-learning-icml-2022-252f39865c70">Graph Machine Learning @ ICML 2022</a> was originally published in <a href="https://medium.com/data-science">TDS Archive</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
        <item>
            <title><![CDATA[GraphGPS: Navigating Graph Transformers]]></title>
            <link>https://medium.com/data-science/graphgps-navigating-graph-transformers-c2cc223a051c?source=rss-4d4f8ddd1e68------2</link>
            <guid isPermaLink="false">https://medium.com/p/c2cc223a051c</guid>
            <category><![CDATA[graph-machine-learning]]></category>
            <category><![CDATA[artificial-intelligence]]></category>
            <category><![CDATA[computer-science]]></category>
            <category><![CDATA[geometric-deep-learning]]></category>
            <category><![CDATA[machine-learning]]></category>
            <dc:creator><![CDATA[Michael Galkin]]></dc:creator>
            <pubDate>Tue, 14 Jun 2022 02:04:43 GMT</pubDate>
            <atom:updated>2022-06-14T18:44:06.909Z</atom:updated>
            <content:encoded><![CDATA[<h4>Recent Advances in Graph ML</h4><h4>Recipes for cooking the best graph transformers</h4><p>In 2021, graph transformers (GT) won recent molecular property prediction challenges thanks to alleviating many issues pertaining to vanilla message passing GNNs. Here, we try to organize numerous freshly developed GT models into a single GraphGPS framework to enable general, powerful, and scalable graph transformers with linear complexity for all types of Graph ML tasks. Turns out, just a well-tuned GT is enough to show SOTA results on many practical tasks!</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*gOUffNZRkXJB5utkz6hlWg.png" /><figcaption>Message passing GNNs, fully-connected Graph Transformers, and positional encodings. Image by Authors</figcaption></figure><p><em>This post was written together with</em><a href="https://rampasek.github.io/"><em> Ladislav Rampášek</em></a><em>, </em><a href="https://twitter.com/dom_beaini?lang=en"><em>Dominique Beaini</em></a><em>, and </em><a href="https://vijaydwivedi.com.np/"><em>Vijay Prakash Dwivedi</em></a><em> and is based on the paper </em><a href="https://arxiv.org/abs/2205.12454"><em>Recipe for a General, Powerful, Scalable Graph Transformer (2022)</em></a> <em>by Rampášek et al. You can also follow </em><a href="https://twitter.com/michael_galkin"><em>me</em></a><em>, </em><a href="https://twitter.com/rampasek"><em>Ladislav</em></a><em>, </em><a href="https://twitter.com/vijaypradwi"><em>Vijay</em></a><em>, and </em><a href="https://twitter.com/dom_beaini"><em>Dominique</em></a><em> on Twitter.</em></p><p>Outline:</p><ol><li><a href="#675f">Message Passing GNNs vs Graph Transformers</a></li><li><a href="#3975">Pros, Cons, and Variety of Graph Transformers</a></li><li><a href="#40a9">The GraphGPS framework</a></li><li><a href="#e373">General: The Blueprint</a></li><li><a href="#cf65">Powerful: Structural and Positional Features</a></li><li><a href="#6be7">Scalable: Linear Transformers</a></li><li><a href="#beae">Recipe time — how to get the best out of your GT</a></li></ol><h3><strong>Message Passing GNNs vs Graph Transformers</strong></h3><p>Message passing GNNs (conventionally analyzed from the <a href="https://towardsdatascience.com/graph-neural-networks-beyond-weisfeiler-lehman-and-vanilla-message-passing-bc8605fa59a">Weisfeiler-Leman perspective</a>) notoriously suffer from <a href="https://openreview.net/forum?id=S1ldO2EFPr"><strong>over-smoothing</strong></a> (increasing the number of GNN layers, the features tend to converge to the same value), <a href="https://openreview.net/forum?id=i80OPhOCVH2"><strong>over-squashing</strong></a> (losing information when trying to aggregate messages from many neighbors into a single vector), and perhaps most importantly, poor capturing of long-range dependencies which is noticeable already on small but sparse molecular graphs.</p><p>Today, we know many ways to break through the glass ceiling of message passing — including <a href="https://towardsdatascience.com/using-subgraphs-for-more-expressive-gnns-8d06418d5ab">higher-order GNNs</a>, better <a href="https://openreview.net/forum?id=7UmjRGzp-A">understanding of graph topology</a>, <a href="https://towardsdatascience.com/graph-neural-networks-as-neural-diffusion-pdes-8571b8c0c774">diffusion models</a>, <a href="https://arxiv.org/abs/2110.09443">graph rewiring</a>, and <a href="https://arxiv.org/abs/2012.09699">graph transformers</a>!</p><p>Whereas in the message passing scheme a node’s update is a function over its <strong>neighbors</strong>, in GTs, a node’s update is a function of <strong>all</strong> nodes in a graph (thanks to the self-attention mechanism in the Transformer layer). That is, an input to a GT instance is the whole graph.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*1OG1LhGV9klzHkOCCUYF3Q.png" /><figcaption>Updating a target node representation (red), local message passing aggregates only immediate neighbors while global attention is a function of all nodes in a graph. In GraphGPS, we combine both! Image by Authors</figcaption></figure><h3><strong>Pros and Cons of Graph Transformers</strong></h3><p>Feeding the whole graph into the Transformer layer brings several immediate benefits and drawbacks.</p><p>✅ Pros:</p><ul><li>Akin to graph rewiring, we now decouple the node update procedure from the graph structure.</li><li>No problem handling long-range connections as all nodes are now connected to each other (we often separate <em>true</em> edges coming from the original graph and <em>virtual</em> edges added when computing the attention matrix — check the illustration above where solid lines denote real edges and dashed lines — virtual ones).</li><li>Bringing the “Navigating a Maze” analogy, instead of walking and looking around, we can use a map, destroy maze walls, and use magic wings 🦋. We have to learn the map beforehand though, and later we’ll see how to make the navigation more precise and efficient.</li></ul><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*fD-jDP5ewyvC7AlMq_2pPw.png" /><figcaption>Graph Transformers give you wings 🦋. Source: <a href="https://twitter.com/dom_beaini/status/1499019741234704385">Dominique Beaini @ Twitter</a></figcaption></figure><p>🛑 The cons are similar to those stemming from using transformers in NLP:</p><ul><li>Whereas language input is sequential, graphs are permutation-invariant to node ordering. We need better <strong>identifiability</strong> of nodes in a graph — this is often achieved through some form of positional features. For instance, in NLP, the <a href="https://arxiv.org/abs/1706.03762">original Transformer</a> uses sinusoidal positional features for the absolute position of a token in a sequence, whereas a recent <a href="https://openreview.net/forum?id=R8sQPpGCv0">AliBi</a> introduces a relative positional encoding scheme.</li><li>Loss of inductive bias that enables GNNs to work so well on graphs with <strong>pronounced locality</strong>, which is the case in many real-world graphs. Particularly in those where edges represent relatedness/closeness. By rewiring the graph to be fully connected, we have to put the structure back in some way, otherwise, we are likely to “throw the baby out with the water”.</li><li>Last-but-not-least, a limitation can be the <strong>square</strong> computational <strong>complexity</strong> O(N²) in the number of nodes whereas message passing GNNs are linear in the number of edges O(E). Graphs are often sparse, i.e., N ≈ E, so the computational burden grows with larger graphs. Can we do something about it?</li></ul><p>2021 brought a great variety of positional and structural features for GTs to make nodes more distinguishable.</p><p>In graph transformers, the <a href="https://arxiv.org/abs/2012.09699">first GT architecture by Dwivedi &amp; Bresson</a> used Laplacian <strong>eigenvectors</strong> as positional encodings, <a href="https://arxiv.org/abs/2106.03893">SAN by Kreuzer et al</a> also added Laplacian <strong>eigenvalues</strong> to re-weight attention accordingly, <a href="https://arxiv.org/pdf/2106.05234.pdf">Graphormer by Ying et al</a> added <strong>shortest path distances</strong> as attention bias, <a href="https://arxiv.org/abs/2201.08821">GraphTrans by Wu, Jain et al</a> run a GT after passing a graph through a GNN, and the <a href="https://arxiv.org/abs/2202.03036">Structure-Aware Tranformer by Chen et al</a> aggregates a k-hop subgraph around each node as its positional feature.</p><p>In the land of graph positional features, in addition to the Laplacian-derived features, a recent batch of works includes <a href="https://arxiv.org/abs/2110.07875">Random Walk Structural Encodings (RWSE) by Dwivedi et al</a> that takes a diagonal of m-th power random walk matrix, <a href="https://arxiv.org/abs/2202.13013">SignNet by Lim, Robinson et al</a> to ensure sign invariance of Laplacian eigenvectors, and <a href="https://openreview.net/pdf?id=e95i1IHcWj">Equivariant and Stable PEs by Wang et al</a> that ensures permutation and rotation equivariance of node and position features, respectively.</p><p>Well, there are so many of them 🤯 How do I know what suits best for my task?</p><blockquote>Is there a principled way to organize and work with all those graph transformer layers and positional features?</blockquote><p>Yes! That is what we present in our recent paper <a href="https://arxiv.org/abs/2205.12454">Recipe for a General, Powerful, Scalable Graph Transformer</a>.</p><h3><strong>The GraphGPS framework</strong></h3><p>In GraphGPS, GPS stands for:</p><p><strong>🧩 General</strong> — we propose a blueprint for building graph transformers with combining modules for features (pre)processing, local message passing, and global attention into a single pipeline</p><p><strong>🏆 Powerful</strong> — the GPS graph transformer is provably more powerful than the 1-WL test when paired with proper positional and structural features</p><p><strong>📈 Scalable</strong> — we introduce linear global attention modules and break through the long-lasting issue of running graph transformers only over molecular graphs (less than 100 nodes on average). Now we can do it on graphs of many thousands of nodes each!</p><p>Or maybe it means Graph with Position and Structure? 😉</p><h3><strong>General: The Blueprint</strong></h3><p>Why do we need to tinker with message passing GNNs and graph transformers to enable a certain feature if we could use the best of both worlds? Let the model decide what’s important for a given set of tasks and graphs. Generally, the blueprint might be described in one picture:</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*QKN2j0vBNS8fF-W2EuW5NQ.png" /><figcaption>The GraphGPS blueprint proposes a modular architecture for building graph transformers with various positional, structural features, as well as local and global attention. Source: <a href="https://arxiv.org/abs/2205.12454">arxiv</a>. Click to enlarge</figcaption></figure><p>It looks a bit massive, so let’s break it down part by part and check what is happening there.</p><p>Overall, the blueprint consists of <strong>3 major components</strong>:</p><ol><li>Node identification through positional and structural encodings. After analyzing many recently published methods for adding positionality in graphs, we found they can be broadly grouped into 3 buckets: <strong>local</strong>, <strong>global</strong>, and <strong>relative</strong>. Such features are provably powerful and help to overcome the notorious 1-WL limitation. More on that below!</li><li>Aggregation of node identities with original graph features — those are your input node, edge, and graph features.</li><li>Processing layers (GPS layers) — how we actually process the graphs with constructed features, here combine both local message passing (any MPNNs) and global attention models (any graph transformer)</li><li>(Bonus 🎁) You can combine any positional and structural feature with any processing layer in our new <a href="https://github.com/rampasek/GraphGPS">GraphGPS</a> library based on <a href="https://www.pyg.org/">PyTorch-Geometric</a>!</li></ol><h3><strong>Powerful: Structural and Positional Features</strong></h3><p>Structural and positional features aim at encoding a unique characteristic of each node or edge. In the most basic case (illustrated below), when all nodes have the same initial features or no features at all, applying positional and structural features helps to distinguish nodes in a graph, assign them with diverse features, and provide at least some sense of graph structure.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Uc094aOMW4zd0Ycng0H2Xw.png" /><figcaption>Structural and positional features help distinguishing nodes in a graph. Image by Authors.</figcaption></figure><p>We usually separate <em>positional</em> from <em>structural</em> features (although there are works on the theoretical equivalence of those like in <a href="https://arxiv.org/abs/1910.00452">Srinivasan &amp; Ribeiro</a>).</p><blockquote>Intuitively, positional features help nodes to answer the question<strong> “Where am I?”</strong> while structural features answer “What does my neighborhood look like?”</blockquote><p><strong>Positional encodings (PEs)</strong> provide some notion of the position in space of a given node within a graph. They help a node to answer the question <strong>“Where am I?”</strong>. Ideally, we’d like to have some sort of Cartesian coordinates for each node, but since graphs are topological structures and there exist an infinite amount of ways to position a graph on a 2D plane, we have to think of something different. Talking about PEs, we categorize existing (and theoretically possible) approaches into 3 branches — local PEs, global PEs, and relative PEs.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*lfGeiGoxi1Yyf2BzdYrKWQ.png" /><figcaption>Categorization of Positional encodings (PEs). Click to enlarge. Image by Authors.</figcaption></figure><p>👉 Local PEs (as node features) — within a <strong>cluster</strong>, the closer two nodes are to each other, the closer their local PE will be, such as the position of a word in a sentence (but not in the text). Examples: (1) distance between a node and the centroid of a cluster containing this node; (2) sum of non-diagonal elements of the random walk matrix of m-steps (m-th power).</p><p>👉 Global PEs (as node features) — within a <strong>graph</strong>, the closer two nodes are, the closer their global PEs are, such as the position of a word in a text. Examples: (1) eigenvectors of the adjacency or Laplacian used in the original <a href="https://arxiv.org/abs/2012.09699">Graph Transformer</a> and <a href="https://arxiv.org/abs/2106.03893">SAN</a>; (2) distance from the centroid of the whole graph; (3) unique identifier for each connected component</p><p>👉 Relative PEs (as edge features) — edge representation that is correlated to the distance given by any local or global PE, such as distance between two worlds. Examples: (1) pair-wise distances obtained from heat kernels, random walks, <a href="https://en.wikipedia.org/wiki/Green%27s_function">Green’s function</a>, graph geodesic; (2) gradient of eigenvectors of adjacency or Laplacian, or gradient of any local/global PEs.</p><p>Let’s check an example of various PEs on this famous molecule ☕️ (<a href="https://youtu.be/w6Pw4MOzMuo?t=387">the favorite molecule of Michael Bronstein according to his ICLR’21 keynote</a>).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*PvFUVl8Avp6PuPEV8jhGGA.png" /><figcaption>Illustration of local, global, and relative <strong>positional</strong> encodings on a caffeine molecule ☕️. Image by Authors</figcaption></figure><p><strong>Structural encodings</strong> <strong>(SEs)</strong> provide a representation of the structure of graphs and subgraphs. They help a node to answer the question <strong>“What does my neighborhood look like?”</strong>. Similarly, we categorize possible SEs into local, global, and relative, although under a different sauce.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*L-xk6SYPVpPMiz5Fjxt0Sw.png" /><figcaption>Categorization of Structural encodings (SEs). Click to enlarge. Image by Authors.</figcaption></figure><p>👉 Local SEs (as node features) allow a node to understand what substructures it is a part of. That is, given two nodes and SE of radium m, the more similar are m-hop subgraphs around those nodes, the closer will be their local SEs. Examples: (1) Node degree (used in <a href="https://arxiv.org/pdf/2106.05234.pdf">Graphormer</a>); (2) diagonals of m-steps random walk matrix <a href="https://arxiv.org/abs/2110.07875">(RWSE)</a>; (3) <a href="https://openreview.net/forum?id=7UmjRGzp-A">Ricci curvature</a>; (4) enumerate or count substructures like triangles and rings (<a href="https://arxiv.org/abs/2006.09252">Graph Substructure Networks</a>, <a href="https://openreview.net/forum?id=Mspk_WYKoEH">GNN As Kernel</a>).</p><p>👉 Global SEs (as a graph feature) provide the network with information about the global structure of a graph. If we compare two graphs, their global SEs will be close if their structure is similar. Examples: (1) eigenvalues of the adjacency or Laplacian (used in <a href="https://arxiv.org/abs/2106.03893">SAN</a>); (2) well-known graph properties like diameter, number of connected components, girth, average degree.</p><p>👉 Relative SEs (as edge features) allow two nodes to understand how much their structures differ. Those can be gradients of any local SE or a boolean indicator when two nodes are in the same substructure (eg, as in <a href="https://arxiv.org/abs/2006.09252">GSN</a>).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*A5ZCFTBEmKsdf4aKKIKhNg.png" /><figcaption>Illustration of local, global, and relative <strong>structural</strong> encodings on a caffeine molecule ☕️. Image by Authors</figcaption></figure><p>Depending on the graph structure, positional and structural features might bring a lot of expressive power surpassing the 1-WL limit. For instance, in highly-regular <a href="https://github.com/PurdueMINDS/RelationalPooling/tree/master/">Circular Skip Link (CSL)</a> graphs, eigenvectors of Laplacian (Global PEs in our framework) assign unique and different node features for CSL (11, 2) and CSL (11, 3) graphs making them clearly distinguishable (where 1-WL fails).</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*Ztohsfwxp2rB9naHmoAj_A.png" /><figcaption>Positional (PEs) and Structural (SEs) features are more powerful than 1-WL but their effectiveness might depend on the nature of the graphs. Image by Authors.</figcaption></figure><p><strong>Aggregation of PEs and SEs</strong></p><p>Given that PEs and SEs might be beneficial in different scenarios, why would you limit the model with just one positional / structural feature?</p><p>In GraphGPS, we allow to combine any arbitrary amount of PEs and SEs, e.g., 16 Laplacian eigenvectors + eigenvalues + 8d RWSE. In pre-processing, we might have many vectors for each node and we use set aggregation functions that would map them to a single vector to be added to the node feature.</p><p>This mechanism enables using <a href="https://towardsdatascience.com/using-subgraphs-for-more-expressive-gnns-8d06418d5ab">subgraph GNNs</a>, <a href="https://arxiv.org/abs/2202.13013">SignNets</a>, <a href="https://openreview.net/pdf?id=e95i1IHcWj">Equivariant and Stable PEs</a>, <a href="https://arxiv.org/abs/2202.03036">k-Subtree SATs</a>, and other models that build node features as complex aggregation functions.</p><p>Now, equipped with expressive positional and structural features, we can tackle the final challenge — scalability.</p><h3><strong>Scalable: Linear Transformers 🚀</strong></h3><p>Pretty much all existing graph transformers employ a standard self-attention mechanism materializing the whole N² matrix for a graph of N nodes (thus assuming the graph is fully connected). On one hand, it allows to imbue GTs with edge features (like in <a href="https://arxiv.org/pdf/2106.05234.pdf">Graphormer</a> that used edge features as attention bias) and separate true edges from virtual edges (as in <a href="https://arxiv.org/abs/2106.03893">SAN</a>). On the other hand, materializing the attention matrix has square complexity O(N²) making GTs hardly scalable to anything beyond molecular graphs of 50–100 nodes.</p><p>Luckily, the vast research around Transformers in NLP recently proposed a number of Linear Transformer architectures such as <a href="https://arxiv.org/abs/2006.04768">Linformer</a>, <a href="https://openreview.net/forum?id=Ua6zuk0WRH">Performer</a>, <a href="https://arxiv.org/abs/2007.14062">BigBird</a> to scale attention linearly to the input sequence O(N). The whole <a href="https://arxiv.org/abs/2011.04006">Long Range Arena</a> benchmark has been created to evaluate linear transformers on extremely long sequences. The essence of linear transformers is to bypass the computation of the full attention matrix and rather approximate its result with various mathematical “tricks” such as low-rank decomposition in Linformer or softmax kernel approximation in Performer. Generally, this is a very active research area 🔥 and we expect there will be more and more effective approaches coming soon.</p><figure><img alt="" src="https://cdn-images-1.medium.com/max/1024/1*tcPpbRskq_tr7DlknezIYA.png" /><figcaption>Vanilla N² transformers (left) materialize the full attention matrix whereas linear transformers (right) bypass this stage through various approximations without significant performance loss. Image by Authors.</figcaption></figure><p>Interestingly, there is not much research on linear attention models for graph transformers — to date, we are only aware of the recent <a href="https://arxiv.org/abs/2107.07999">ICML 2022 work of Choromanski et al</a> which, unfortunately, did not run experiments on reasonably large graphs.</p><p>In GraphGPS, we propose to replace the global attention module (vanilla Transformer) with pretty much <strong>any</strong> available linear attention module. Applying linear attention leads to two important questions that took us a great deal of experiments to answer:</p><ol><li>Since there is no explicit attention matrix computation, how to incorporate edge features? Do we need edge features in GTs at all?<br><strong>Answer</strong>: Empirically, on the datasets we benchmarked, we found that linear <em>global</em> attention in GraphGPS works well even without edge features (given that edge features are processed by some <em>local</em> message passing GNN). Further, we theoretically demonstrate that linear global attention does not lose edge information when input node features already encode the edge features.</li><li>What is the tradeoff between speed and performance of linear attention models?<br><strong>Answer</strong>: The tradeoff is quite beneficial — we did not find major performance drops when switching from square to linear attention models, but found a huge memory improvement. That is, at least in the current benchmarks, you can simply swap the full attention to linear and train models on dramatically larger graphs without huge performance losses. Still, if we want to be more sure about linear global attention performance, there is a need for larger benchmarks with larger graphs and long-range dependencies.</li></ol><h3><strong>👨‍🍳 Recipe time — how to get the best out of your GT</strong></h3><p>Long story short — a tuned GraphGPS, a combination of local and global attention, performs very competitively to more sophisticated and computationally more expensive models and sets a <strong>new SOTA</strong> on many benchmarks!</p><p>For example, in the molecular regression benchmark ZINC, GraphGPS reaches a new all-time low 0.07 MAE. The progress in the field is really fast — last year’s SAN did set a SOTA of 0.139, so we improved the error rate by a solid 50%! 📉</p><p>Furthermore, thanks to efficient implementation, we dramatically improved the speed of graph transformers — about 400% faster 🚀 — 196 s/epoch on ogbg-molpcba compared to 883 s/epoch of SAN, the previous SOTA graph transformer model.</p><p>We experiment with <a href="https://openreview.net/forum?id=Ua6zuk0WRH">Performer</a> and <a href="https://arxiv.org/abs/2007.14062">BigBird</a> as linear global attention models and scale GraphGPS to graphs of up to 10,000 nodes fitting on a standard 32 GB GPU which is previously unattainable by any graph transformer.</p><p>Finally, we open-source the <a href="https://github.com/rampasek/GraphGPS">GraphGPS library</a> (akin to the <a href="https://arxiv.org/abs/2011.08843">GraphGym</a> environment), where you can easily plug, combine, and configure:</p><ul><li>Any local message passing model with and without edge features</li><li>Any global attention model, eg, full Transformer or any linear architecture</li><li>Any structural (SE) and positional (PE) encoding method</li><li>Any combination of SEs and PEs, eg, Laplacian PEs with Random-Walk RWSE!</li><li>Any method for aggregating SEs and PEs, eg, SignNet or DeepSets</li><li>Run it on any graph dataset supported by PyG or with a custom wrapper</li><li>Run large-scale experiments with Wandb tracking</li><li>And, of course, replicate the results of our experiments</li></ul><p>📜 arxiv preprint: <a href="https://arxiv.org/abs/2205.12454">https://arxiv.org/abs/2205.12454</a></p><p>🔧 Github repo: <a href="https://github.com/rampasek/GraphGPS">https://github.com/rampasek/GraphGPS</a></p><img src="https://medium.com/_/stat?event=post.clientViewed&referrerSource=full_rss&postId=c2cc223a051c" width="1" height="1" alt=""><hr><p><a href="https://medium.com/data-science/graphgps-navigating-graph-transformers-c2cc223a051c">GraphGPS: Navigating Graph Transformers</a> was originally published in <a href="https://medium.com/data-science">TDS Archive</a> on Medium, where people are continuing the conversation by highlighting and responding to this story.</p>]]></content:encoded>
        </item>
    </channel>
</rss>