Gappy : An RL Alignment Engine

Inspiration

Antimicrobial resistance is a growing global crisis threatening human and animal health. One key challenge in studying resistance is Multiple Sequence Alignment (MSA), which is crucial for detecting subtle mutations that contribute to antibiotic resistance. However, MSA is an NP-hard problem, meaning that aligning thousands of sequences optimally is computationally infeasible with traditional methods.

Recent research has explored reinforcement learning (RL) for MSA, demonstrating promising results. Our work is inspired by:

RLALIGN (Ramakrishnan et al., 2018), which introduced an A3C-based RL method for MSA and showed that reinforcement learning could outperform classical progressive alignment algorithms.
Deep Reinforcement Learning with Self-Attention for MSA (Liu et al., 2023), which leveraged self-attention mechanisms and positional encoding to improve accuracy.

We set out to build an AI model that learns optimal gap insertion strategies from human intuition using RL.

What it does

Gappy leverages Actor-Critic Reinforcement Learning to optimize MSA. Instead of relying on fixed heuristic rules, Gappy learns the best alignment strategies through trial and error, inserting gaps to maximize sequence similarity while following biological constraints.

Key Features

Custom RL Environment – Simulates MSA puzzles with a reward function inspired by biological heuristics.
Action Masking – Prevents invalid moves (e.g., inserting gaps where they don’t improve the alignment).
Convolutional Neural Networks (CNNs) – Extracts structural patterns from sequences to guide alignment decisions.
Actor-Critic Architecture – Balances exploration (testing new gap insertions) with exploitation (optimizing scores).
Human-readable Move Tracking – We implemented human-readable move tracking and visualization tools to make alignments more transparent and explainable.

How we built it

We designed and implemented a Reinforcement Learning (RL) framework to solve the Multiple Sequence Alignment (MSA) problem by focusing on adaptive gap insertion. At the core of our system is an Actor-Critic neural network, which learns to place gaps in sequences to maximize an alignment quality score we call the gearbox score. The environment simulates the alignment process as a game board, where each action inserts a gap into a specific sequence at a specific position. To guide learning, we developed a custom reward function, based on the gearbox scoring function, that rewards column consensus and penalizes unnecessary gaps. We also implemented action space masking, which restricts the agent to meaningful moves, improving learning efficiency. Our agent was trained using Advantage Actor-Critic (A2C), enabling it to balance exploration and exploitation effectively. After training, the model can align new sequences without retraining, providing adaptive, data-driven alignments. We visualized the agent’s decision process by plotting each step of the alignment, making the learning process interpretable and transparent.

Challenges we ran into

One of the key challenges we encountered was integrating reinforcement learning with the existing gearbox scoring system in a way that led to meaningful learning. While the gearbox score provided a clear target for alignment quality, translating it into an effective reward signal for the RL agent required careful tuning. We also faced difficulties managing the size of the action space, which scales with both the number of sequences and their length. Without action space restrictions, the agent’s learning was slow and inefficient, so we implemented action masking to ensure it focused only on valid and productive gap insertions. Another challenge was balancing exploration and exploitation : early in training, the agent either explored too randomly or converged prematurely on suboptimal strategies. To address this, we adjusted entropy regularization and carefully tuned hyperparameters to encourage more effective learning behaviors. Generalization was another concern; models trained on specific puzzles often fail to perform well on new alignments with different sequence lengths or structures. Finally, training the agent was computationally intensive. Actor-Critic methods require significant resources and time, so we optimized the model architecture and training process to improve efficiency.

Accomplishments that we're proud of

We’re proud to have successfully developed a reinforcement learning framework that learns to perform multiple sequence alignment through adaptive gap insertion. Our Actor-Critic agent navigates a complex action space and makes intelligent decisions that consistently improve alignment quality. We integrated the provided gearbox scoring system into an effective reward function, enabling the agent to learn strategies that generalize to new puzzles without retraining. To improve efficiency, we implemented action masking, allowing the agent to focus only on productive gap insertions. One of our biggest accomplishments was building a user-friendly front end, Mind the Gap, that visualizes the alignment process step-by-step. This interface not only makes the agent’s decision-making transparent and easy to follow, but also helps users understand how reinforcement learning can be applied to sequence alignment. Overall, we’ve built a system that works effectively, explains itself clearly, and has the potential to scale to more complex alignment tasks.