DiALignment | Devpost

Inspiration

LLMs are awesome, but they're really hard to understand with the hundreds of billions of parameters. Right now the best way to control them is through prompt engineering, fine-tuning, and in-context learning but what if you could just control the weights directly with none of that needed. So inspired by Golden Gate Claude and other mechanistic interpretability work we set out to make a tool that lets users steer LLM themselves just like Anthropic's been.

Tools like this would allow for model to be controlled with far less compute than fine-tuning and without

What it does

We allow the user to type in certain concepts and easily delete them from the LLM. Our project provides a faster and easier alternative to controlling the output of an LLM than fine-tuning, RAG, and prompt engineering. We also utilize a simpler variation of this approach focused on refusal to evaluate the susceptibility of various open source LLMs to a "one directional refusal cancellation" type of attack which acts as a good measure of how they handle refusals.

How we built it

We used TransformerLens to investigate the activations of open source transformer models and then to modify the activations to create our desired behavior.

In order to get a distribution of activations we use the Gemini API to generate a set of prompts all relating to one topic. We can then calculate the average of these activations to get the direction representing that topic and then remove any activations in that direction.

The Gemini API was very helpful for this purpose as it let us deactivate its censorship and was quick and cheap.

To make an easy to use front end for people to investigate LLM activations we used StreamLit.

Challenges we ran into

SO MANY. This project was very ambitious for a hackathon.

The first challenge is that half of our team is beginners who were completely unfamiliar with transformers or Gemma or Streamlit and so getting up to speed and understanding the project was initially challenging but they learned quick and understood because they're very smart.

Next was that the library we used to investigate the LLM activations transformer_lens was pretty buggy and had some memory leaks. Meaning after a few hours we'd get "VSCode is using 270GB of space" and then the computer would crash. This required digging deeper into the library to minimize GPU usage like only caching the pre_residual streams because by default it caches the activations for every step in the transformer. To mitigate the memory limitations we exclusively used <8b models.

The other problem is that nearly every LLM has different tokenizers and they're not all in the hugging face library making some models like phi or Qwen more difficult to setup.

The possibly largest challenge was our small GPUs. Henry's computer could run the LLMs fine but the rest of us ran out of vram quickly so he set up a Flask endpoint we could all use to request his computer use LLMs.

Accomplishments that we're proud of

We created something unique and instead of an API call we dug deep into LLM internals. We stuck through no matter what. We all quickly picked up new skills. We replicated famous LLM problems and showed our software can solve them.

What we learned

Daniel - How to use transformer_lens. How to dig into the residual stream to mess with prompts. How brittle and fragile LLM activations are. A minor shift in activations can turn cake instructions into Vietnamese last names.

Henry - How to lobotomize language models and set up public API endpoints with ngrok and Flask.

Ryan - I learned a lot about mechanistic interpretability and transformer architecture, as well as utilizing the Gemini API in various applications.

Kevin - I learned a lot about LLM's architecture with their activation vectors along with learning how to do frontend development with streamlit, api usage, and websites that i can use for research on machine learning models and development of such models.

What's next for DiALignment

We would love to create more tools to let people understand LLM activations. Currently we map topics to neurons but it'd be amazing if given a neuron we can show what features that neuron normally fires for. This would be AMAZING for LLM interpretability and allow for debugging strange LLM errors. For example most LLMs say 9.11 is greater than 9.8 even though they don't make any mistakes for similar numbers. Our program allows users to zero out the Bible directions and the 9/11 directions which somehow fixes this error. It would be great if we made something that could itself identify which topics need to be removed.

Another thing would be living up to the "dial" part of our name. We would love to give continuous control over any feature to users allowing for greater customization with no fine-tuning or prompt engineering. Unfortunately not all of features are specified by a linear direction like refusal is and so we could not achieve that in this hackathon.