Patrick Haller

Exploring Subquadratic Language Models for Sample-Efficient Pretraining

Fri, 29 Nov 2024 00:00:00 +0000

Our paper got accepted at EMNLP 2024 at the CoNLL BabyLM Workshop!

Read the full paper here

Abstract

This paper explores the potential of recurrent neural networks (RNNs) and other subquadratic architectures as competitive alternatives to transformer-based models in low-resource language modeling scenarios.
We utilize HGRN2 (Qin et al., 2024), a recently proposed RNN-based architecture, and comparatively evaluate its effectiveness against transformer-based baselines and other subquadratic architectures (LSTM, xLSTM, Mamba). Our experimental results show that BABYHGRN, our HGRN2 language model, outperforms transformer-based models in both the 10M and 100M word tracks of the challenge, as measured by their performance on the BLiMP, EWoK, GLUE and BEAR benchmarks. Further, we show the positive impact of knowledge distillation. Our findings challenge the prevailing focus on transformer architectures and indicate the viability of RNN-based models, particularly in resource-constrained environments.

Chapter 1

What is BabyLM?

We published this paper as part of the BabyLM Challenge. Let’s begin by explaining what this challenge is all about.

The challenge is targeted towards researchers who are interested in pretraining and/or cognitive modeling and optimizing pretraining given limited data inspired by human development. The primary goal is to foster research around this topic with a secondary goal of democratizing pretraining and training practices - which are typically targeted towards large, resource-rich research and industry groups.

This is realized through a challenge, where a restricted amount of pre-training data is allowed. They are defined as strict-small and strict, where a model is only allowed to be trained with 10M and 100M tokens respectively. How often the model sees the data does not matter.

Submitted models are evaluated on three zero-shot benchmarks BLiMP, BLiMP-Supplement, and EWoK and fine-tuned+evaluated on a subset of the (Super)GLUE datasets.

Chapter 2

Subquadratic LMs as Alternatives to Transformers

One cool thing about the BabyLM Challenge is, that it is not necessarily about pushing the benchmark scores to their limits, but to explore alternative architectures, training strategies, learning paradigms and data augmentation techniques. This created a wide range of submissions and a lot of creative approaches and interesting findings. I can only recommend to checkout the proceedings of the workshop to get an overview of everything.

Link to Proceedings: BabyLM Workshop

One of the key motivations behind our work is to explore the potential of subquadratic architectures as competitive alternatives to transformer-based models in low-resource language modeling scenarios.

But why should we consider subquadratic models in the first place?

Transformer-based models have become the de facto standard for a wide range of NLP tasks due to their strong performance across various benchmarks. A big selling point of transformers, is their ability to process input sequences in parallel, which makes them highly efficient, scalable and therefore suitable for large-scale pretraining of Language Models. This overshadowed the, in comparison, sequential processing of RNNs, which are often seen as slow and computationally expensive.

If we had to write down the computationally complexity, it would look like this:

Doesn’t look too bad for RNNs in terms of complexity, right? The crucial point is the number of operations needed to process a sequence of length n, which is linear for RNNs. The high computational costs of Transformers are overcome through massive parallelization of the attention mechanism, which is key to their success. While a true RNN cannot overcome this bottleneck, several recent architectures have attempted to address this issue.

There is a wide variety of proposed new architectures that, at least to some extend, resemble RNNs. Following shows a non-exhaustive list of subquadratic architectures:

These architectures share a common goal of reducing the computational complexity of the model by introducing some kind of approximation or by reducing the number of operations needed to process the input sequence. This usually results in a trade-off between performance and computational efficiency. Ideally, a subquadratic model should be able to compete with transformer-based models in terms of performance, while being as efficient for training and more efficient for inference.

In a future post, we will dive deeper into how this is achieved through Linear Attention and all the other cool stuff that is going on in the field of subquadratic models.

So that is what we looked into. We utilized HGRN2, a recently proposed RNN-based architecture, and comparatively evaluated its effectiveness against transformer-based baselines and other subquadratic architectures like LSTM, xLSTM, and Mamba.

Chapter 3

Comparative Evaluation

For a fair comparison, we trained all models on the same data and used the same hyperparameters, except for the learning rate, which was tuned individually for each model. We therefore conducted a learning rate sweep to find the optimal learning rate for each model. Each model was trained on the strict-small track of the challenge for 5 epochs. After each epoch, we evaluated the model on the BabyLM benchmarks. Following table shows the results of our experiments:

The evaluation revealed several interesting patterns across different model architectures. HGRN2 exhibited the strongest overall performance, followed closely by xLSTM and Mamba. Both models outperformed the transformer baseline, suggesting that these architectures offer distinct advantages in low-resource scenarios.

This makes the HGRN2 quite usefull for BabyLM and other low-resource scenarios, especially given the low computational costs of training and inference!

For our final submission, we wanted to pump those numbers up and decided to use knowledge distillation to further improve the performance of our model. We used one of the simpler setups for knowledge distillation, by training with Cross-Entropy loss and the teacher’s predictions as soft targets.

\[Loss_{KD} = Loss_{CE} + Loss_{KD}\]

where \(Loss_{CE}\) is the Cross-Entropy loss and \(Loss_{KD}\) is the knowledge distillation loss.

\[Loss_{KD} = KL(\sigma(p_i), \sigma(q_i))\]

where:

\(z_t\) and \(z_s\) are the output logits of the teacher and student model respectively
\(\sigma(z)\) is the softmax function applied to the logits \(z\)

Instead of the traditional approach of distilling from a larger to a smaller model, we used same-sized teacher and student models. Trained on the same dataset!

… Which actually worked out quite well! The knowledge distillation improved the overall performance of our model, which is quite impressive given the simplicity of the setup.

The organizers of the BabyLM Challenge set up this nice leaderboard, where you can see the performance of all submissions.

Its quite impressive to see how many different approaches were taken to tackle this challenge and how our really simple approach can compete with being on place 5 in the leaderboard.

For more details about our work, you can find the full paper here.

This, concludes our post. Here are more relevant links:

Modelling Explicit Biases in Instruction-Tuned LLMs

Mon, 08 Jul 2024 00:00:00 +0000

Our paper got accepted at NAACL 2024 Demo Track!

Read the full paper here. Try out the online demo here

Problem Extraction and Coding Challenges

Fri, 26 Apr 2024 00:00:00 +0000

Following post is a short summary of a paper I worked on. Read the full paper here (coming soon!).

Our paper got accepted at LREC-COLING 2024

Everything is still under construction, I created a small page to gives a quick oveview here

SOTA Dataset Generation in NLP

Wed, 14 Feb 2024 00:00:00 +0000

Following post is a short summary of a paper I worked on with. Read the full paper here.

Our paper got accepted at EMNLP 2023 Demo Track!

In the realm of machine and especially NLP, the creation of high-quality labeled data has been a significant bottleneck. We therfore present Fabricator, a toolkit designed to harness the power of LLMs for generating vast, labeled datasets. This approach not only promises to save time and resources but also opens new avenues for research and application in machine learning.

How It Works

By prompting LLMs to produce data for specific tasks, Fabricator efficiently creates training material for downstream NLP models. Imagine generating hundreds of movie reviews with varying sentiments at the push of a button.

The process of learning via dataset generation. A teacher model (LLM) is prompted to generate 500 movie reviews for each sentiment (positive, negative). A smaller student PLM is trained on the generated dataset.

Versatility and Integration

Fabricator supports a wide array of NLP tasks and offering seamless integration with well-known libraries. Whether you’re working on text classification, entity recognition, or any other NLP challenge, Fabricator helps you generate the data you need.

With FABRICATOR, the generation process involves a prompt template that creates the final prompt using all provided arguments. The generator class creates training examples until the maximum number of prompt calls is reached, or the unlabeled dataset is fully annotated. Ultimately, the generator class produces a HuggingFace Dataset instance.

Empowering Research and Development

By providing a means to quickly generate and experiment with new datasets, Fabricator paves the way for innovative research and practical applications in NLP.

import os
from datasets import load_dataset
from haystack.nodes import PromptNode
from fabricator import DatasetGenerator, BasePrompt

dataset = load_dataset("processed_fewshot_imdb", split="train")

prompt = BasePrompt(
    task_description="Generate a {} movie review.",
    label_options=["positive", "negative"],
    generate_data_for_column="text",
)

prompt_node = PromptNode (
    model_name_or_path="gpt-3.5-turbo",
    api_key= os.environ.get("OPENAI_API_KEY"),
    max_length=100,
)

generator = DatasetGenerator(prompt_node)
generated_dataset = generator.generate(
    prompt_template=prompt ,
    fewshot_dataset=dataset,
    fewshot_sampling_strategy="uniform ",
    fewshot_examples_per_class=1,
    fewshot_sampling_column="label",
)
generated_dataset.push_to_hub("generated-movie-reviews")

A script that uses FABRICATOR and generates additional movie reviews based on few-shot examples

Looking Ahead - As the toolkit evolves, it promises to expand its capabilities, supporting an even broader range of tasks and enhancing the NLP community’s ability to tackle complex problems with novel solutions.

For more details, refer to the original paper: Fabricator

A Rust crate to display duration of time in a human readable format

Thu, 17 Nov 2022 00:00:00 +0000

A rust crate that displays duration in a human readable format.

This project is a port of chrono-humanize and now has 0 dependencies.

The reason for creation is that the famous time crate chrono will no longer be maintained. And because I work at a Open Source project onefetch, that relies on chrono and chrono-humanize, which display time duration in a easy to understand/read format, I decided to port chrono-humanize, that just uses std::time;

Here how to use it:

use std::time::Duration;
use time_humanize::HumanTime;


let duration = Duration::from_secs(60);
let human_time = HumanTime::from(duration);
println!("{}", human_time);
// Output: "in one minute"


let human_time = HumanTime::from(-60);
println!("{}", human_time);
// Output: "a minute ago"

You can find it here!

A Runtime Error Debugger

Thu, 01 Oct 2020 00:00:00 +0000

Better runtime error messages!

Are you also constantly seeing the runtime error message the python interpreter is giving you? It lacks some color and more debug information!

Get some good looking error tracebacks and beautifuly formatted last line with all its last values before you crashed the program.

What frosch is doing under the hood is basically following:

def _hook():
    """Overwrite sys.excepthook"""
    sys.excepthook = pytrace_excepthook

We just overwrite the sys.excepthook, which is the function called, when the python program provokes a runtime error. This is catched by the cpython runtime and propagated through it.

You can find the source here