About

The Research Path to GPT-4, Part 2

2024-03-21T00:00:00-07:00

TLDR: This post follows the thread of papers authored by Alec Radford that ultimately led to GPT-4. It observes that original motivation for the next-token prediction was as a representation learning mechanism, and there appears to be a gradual (and somewhat accidental) realization that these models could be used for much more…

Part 1 here

GPT-1: Improving Language Understanding by Generative Pre-Training, June 2018

Authors: Alec Radford, Karthik Narasimhan, Tim Salimans, Ilya Sutskever

The first official paper in the GPT series! (Though the paper doesn’t actually use the GPT acronym…) At its core this paper is a polished, comprehensive study combining the key ideas of the previous two papers we’ve reviewed in part 1. Following the prior works, GPT-1 is viewed as a vehicle to learn good representations via a generative objective, with the plan of using these representations for finetuning a linear layer on downstream classification tasks.

“Although large unlabeled text corpora are abundant, labeled data for… specific tasks is scarce, making it challenging for discriminatively trained models to perform adequately. We demonstrate that large gains on these tasks can be realized by generative pre-training of a language model on a diverse corpus of unlabeled text, followed by discriminative fine-tuning on each specific task.”

GPT-1 is pretrained on the BooksCorpus, a dataset of 7k books (around 1 billion words) with a next token prediction objective. The model has a context length of 512 tokens, trained for one month on eight GPUs (~6k hours).

The most obvious change from prior work is the shift from LSTM to the now ubiquitous (but then hot-off-the-press) transformer architecture — 12 layers, 120M params. An ablation shows this change by itself gives a 5% performance boost (compared to the pretraining boost of 15%).

Another significant change over the previous works is the level of generality the method targets. The model is evaluated on multiple benchmark tasks including sentiment analysis, question answering, and linguistic acceptability. And it does pretty well at all of them. The reason they give for this improvement is the diversity of the pretraining data,

“By pre-training on a diverse corpus with long stretches of contiguous text our model acquires significant world knowledge and ability to process long-range dependencies which are then successfully transferred to solving discriminative tasks such as question answering, semantic similarity assessment, entailment determination, and text classification, improving the state of the art on 9 of the 12 datasets.”

While this benchmarking focuses on training linear classification heads on downstream tasks, importantly, they do a brief investigation into zero-shot capabilities. For example for Q and A benchmarks, they can choose the multiple choice answer that has highest likelihood under the generative model, or for sentiment analysis the words ‘very positive’ or ‘very negative’ are appended at the end of the sentence and the highest likelihood option is selected.

“We designed a series of heuristic solutions that use the underlying generative model to perform tasks without supervised finetuning”

It’s hard to understate the significance of this seemingly innocuous analysis. While it’ll take another year or so for this idea to come to fruition, this marks the initial realization that the models can effectively complete downstream tasks by being used in their native ‘generative’ modes rather than by using model-surgery to extract their representations and finetune.

The stated motivation for this zero-shot investigation is to gain intuition for why generative modeling is a useful. This suggests that the discovery of the effectiveness of using the models in their generative mode is gradual and somewhat accidental.

GPT-2: Language Models are Unsupervised Multitask Learners, 2019

Authors: Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever

There transition from GPT-1 to GPT-2 is a smooth one. The pretraining recipe is unchanged, but there is more of everything. More data, more diversity in the data, more parameters.

The model is 1.5B parameters, trained on 7 billion words from the freshly-scraped WebText dataset (outbound linked webpages from Reddit with >3 Karma). Context length is 1024.

There is also one conceptual leap made in the paper. Our story so far has seen generative modeling exploited as a mechanism to learn representations that will be repurposed for supervised learning systems. But the generative mode is found to be effective at test time, used directly in downstream tasks, without the need for training new linear classification heads – this was foreshadowed by the investigation of the zero-shot capabilities of GPT-1.

“We would like to move towards more general systems which can perform many tasks – eventually without the need to manually create and label a training dataset for each one”

While predictive language models learn to model p(output | input), in multitask settings this must be changed to, p(output | input, task). The key conceptual insight is that language is a special modality that allows the task to be encoded as part of the input. This allows a language model to predict p(output | input-with-task-appended), without requiring architectural changes. (Shout out to [1] for spearheading this.)

“Language provides a flexible way to specify tasks, inputs, and outputs all as a sequence of symbols. For example, a translation training example can be written as the sequence (translate to french, english text, french text). Likewise, a reading comprehension training example can be written as (answer the question, document, question, answer).”

It’s hard to underestimate the impact of this shift in mindset. It means a user can simply ask, in plain English, for the task they want solved. No labeled datasets, no hacking in of classification heads, no backprop-ing gradients.

Yet there is one caveat to this — the pretraining data distribution must be broad enough to have encountered the task. As such, a diverse, broad, rich, high-quality dataset is the lifeblood of the system.

“Most prior work trained language models on a single domain of text, such as news articles, Wikipedia, or fiction books. Our approach motivates building as large and diverse a dataset as possible in order to collect natural language demonstrations of tasks in as varied of domains and contexts as possible… we want to avoid making assumptions about the tasks to be performed ahead of time.”

Careful readers will note that this diversity is not a sudden introduction . It is a direct consequence of the conclusions in GPT-1 and GP(T)-0; success always depended on how well the pretraining distribution matched the downstream task distribution.

A final highlight of the paper is its focus on model size. Note that by 2019, it was no great secret that bigger was usually better in deep learning (e.g. 2015 ResNet paper [2]). But it’s revealing that almost every figure and table in the paper showed performance progressing from 117M to 1.5B params, suggesting that model scale was top of mind for the authors.

This paper also provides a first taste of some future downsides of LLMs. Data contamination between pretraining and test sets, and closed science — no information on training resources are shared, and the open-sourcing of the model weights is delayed.

[1] The Natural Language Decathlon: Multitask Learning as Question Answering

[2] Deep Residual Learning for Image Recognition

The Research Path to GPT-4, Part 1

2024-03-19T00:00:00-07:00

Part 2 here

Pre-GPT: Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks, November 2015

Authors: Alec Radford, Luke Metz, Soumith Chintala

Our story starts with Alec Radford’s first paper. Whilst it may be best-known for introducing the DCGAN, it’s the modeling philosophy that we’re interested in here. It starts with one key observation upon which the entire LLM paradigm ends up being built — unlabeled data is abundant!

“In the context of computer vision, one can leverage the practically unlimited amount of unlabeled images and videos to learn good intermediate representations, which can then be used on a variety of supervised learning tasks”

Recall that at the time (late 2015), nearly all of deep learning was focused on supervised hand-labeled datasets. It’s visionary, really, that this paper foresaw (and pursued) the higher value locked away in larger unlabeled datasets.

The concrete approach in the paper trains a generative model (DCGAN) on a broad-ish dataset (ImageNet). After this pretraining phase, the discriminator can be used to extract features from some new dataset (e.g. CIFAR-10), with a linear classification-head trained on top. The results are inline with other SOTA unsupervised methods of the time.

There’s a long way to go to GPT-4, but this paper has unearthed a key foundational idea here — generative modeling on large unlabeled datasets FTW.

“We give evidence that [GANs] learn good representations of images for supervised learning”

GPT-0(ish): Learning to Generate Reviews and Discovering Sentiment, April 2017

Authors: Alec Radford, Rafal Jozefowicz, Ilya Sutskever

I’ve come to think of this as GPT-0. While it’s not a transformer, it does generative pre-training at large-scale (for the time) on language. The paper makes two findings that will be pivotal to the development of LLMs. 1) A low-level future sequence-prediction objective can lead to a model learning high-level concepts. 2) The pretraining dataset distribution should align with the downstream data distribution.

Tokens were not quite a thing, rather the model is trained to predict the next character in a sequence (more accurately bytes of a UTF-8 encoding). The dataset is made up of reviews from Amazon (around 8 billion words). It uses an LSTM (4096 hidden units), context length 256, and is trained for one month on four GPUs.

The paper’s focus is on the representations the this pretrained model learns. In particular, they study the task of ‘sentiment analysis’, identifying a single neuron whose activation value is indicative of the sentiment of the sequence it’s parsing. Using this single sentiment neuron, thresholded as a classifier, they report results on IMDB inline with baseline methods.

This is early support for a what’s subsequently become a cornerstone intuition about the next-token prediction objective — a model must acquire an understanding of abstract concepts in order to predict the future of a text passage. Here it’s demonstrated that models do infer something as abstract as the sentiment of a review when trained on a simple generative objective. As the authors say of the time,

“it is not immediately clear whether such a low-level training objective supports the learning of high-level representations.”

Something interesting about this paper is that it signals the authors’ dataset-centric thinking. Regarding the pretraining dataset selection, the authors note,

“We train on a very large corpus picked to have a similar distribution as our task of interest.”

Specifically, they’ve chosen to pretrain on Amazon reviews, and test on sentiment analysis benchmarks like IMDB and Yelp. They even blame a performance plateau in Yelp on the reviews being about businesses not products. They do brief experiments on other tasks (semantic relatedness and paraphrase detection) using text from other domains and again find limited performance, blaming the data mismatch in train and test distributions. This turns out to be a key insight that guides the authors to pursue training on increasingly diverse datasets in future work.

As something of an afterthought, they show some shaky generations from the model. Holding the ‘sentiment neuron’ at a fixed value allows control over the sentiment of the generated sentence.

“Although the focus … has been on the properties of our model’s representation, it is trained as a generative model and we are also interested in its generative capabilities”

Machine Learning Eras and their Bottlenecks

2024-02-28T00:00:00-08:00

TLDR: Making sense of where we are in AI research by looking at the bottlenecks of each machine learning era so far, and where this suggests we’re headed.

Eras

The recent history of machine learning might broadly be grouped into three eras. Whilst Eras 1 and 2 are often described in introductory deep learning courses, there has been a more subtle, but just as significant, transition from Era 2 to 3.

Era 1: Shallow Learning (2000’s). Systems were composed of two stages; an initial hand-designed feature extractor (e.g. SIFT, MFCC), followed by a simple learnable model (e.g. SVMs, Gaussian mixture models, decision trees). The main bottleneck to system performance was the quality of the feature extractors.

Era 2: Supervised Deep Learning (2010’s). Feature extraction was absorbed into the learnable portion of the model, removing the need for hand-designed heuristics. Systems were optimized end-to-end on human-labelled datasets using neural networks composed of multiple layers, with architectures matched to the data modality (e.g. CNNs for images, RNNs for sequences). With enough examples, any input-output mapping could be learned, but the main bottleneck was the quantity of labelled data that could feasibly be collected.

Era 3: Self-Supervised Transformers (2020’s). Supervised learning objectives were replaced by simple self-supervised (generative) objectives. This removed the requirement for human-labelled datasets and alleviated the bottleneck on data quantity. There was convergence across data modalities on the transformer architecture (text, images, videos, robotics…). A new bottleneck emerged caused by the misalignment between the self-supervised objective on the pretraining distribution, and the downstream use-cases. Post-hoc methods such as RLHF emerged to address this.

Observations

From these descriptions, several trends become clear.

Observation 1. Sidestep the bottleneck don’t widen it.

Progression from an era is only achieved by removing the bottleneck through a new modeling philosophy, not by widening it! We did not surpass Era 1 by hand-crafting better feature extractors (though most people worked on that), we moved forward by coming up with an approach that avoided the need for hand-crafting at all. We moved from Era 2 to 3 not by increasing the amount of labelled data available (although many research communities focused on variants of this — active learning, semi-supervised learning), but by avoiding the need to label data by using a self-supervised objective.

Observation 2. Human requirements are decreasing, compute requirements are increasing.

In each era, the manual input required by humans becomes more abstracted, from feature engineers, to labellers and architects, to demonstrators. This also means that systems must increasingly learn by themselves, which has led to each era leveraging more compute than its predecessor.

Observation 3. Systems for different data modalities are increasingly homogenous.

In Era 1, each task within a modality required a new system (replacing the learned component, and possibly requiring new features to be extracted). In Era 2, architectural components would be common within a modality (e.g. convolutions for image classification/segmentation/depth prediction) though different across modalities. Datasets would largely be task-specific. Era 3 has seen a single architecture (transformer) dominate across tasks and modalities, and datasets for each modality provide a strong pretrained start-point for multiple tasks within that modality. We are beginning to see multi-modal models, though these are bottlenecked by the quantity of aligned data across modalities.

Extrapolations

It’s risky to extrapolate based on three datapoints, but since we are collecting them at a rate of one per decade, let’s have a guess at what Era 4 might look like.

Extrapolation 1. Era 4 will not have distinct pretraining and alignment phases.

Currently, a large amount of effort goes into taking a pretrained self-supervised transformer, and aligning it (e.g. SFT, RLHF, DPO) to suit downstream users’ needs. It’s hard work and numerous issues persist (e.g. hallucinations, bias, jailbreaking). New eras are created by adopting a modeling paradigm that sidesteps the bottleneck of the previous era (Observation 1). As such, rather than being advanced by improved alignment methods, Era 4 will instead sidestep the need for this alignment phase altogether.

Extrapolation 2. A more abstracted reliance on humans.

Observation 2 notes a trend towards less reliance on manual human input. While Era 3 systems largely ignore human design, human-generated datasets remain at their core. This imposes a hard limit on what a system can learn. Era 4 will continue to reduce reliance on human input, perhaps in a role even more abstract than as demonstrators (e.g. learning from interaction with humans).

Extrapolation 3. The AI research community will become monolithic.

Observation 3 suggests a trend towards a single system that will work across all modalities. It may not make sense to have separate communities working on computer vision, language, robotics etc., as all these areas will be jointly training and utilizing the same common model(s).