GPT Training Implementation

A sophisticated PyTorch implementation of a specialized GPT-2 style Autoregressive Language Model (124M parameters configuration). This project focuses on clarity, modular research, and efficient training on custom datasets like FineWeb-Edu.

🚀 Key Features

Architecture:
- GPT-2 Small Spec: 12 Layers, 12 Heads, 768 Embedding Dim.
- Modern Enhancements:
  - RoPE (Rotary Positional Embeddings) instead of absolute learned embeddings.
  - KVCache support for efficient inference generation.
  - RMSNorm / LayerNorm configuration (configurable).
  - Weight Tying between embedding and output projection layers.
- Gradient Checkpointing support for memory efficiency.
- AMP (Automatic Mixed Precision) training enabled by default.
Tokenizer: Custom Byte-Level BPE Tokenizer with GPT-2 regex-based pre-tokenization logic for correct handling of contractions, punctuation, and whitespaces.
Training Pipeline: Custom Trainer loop integrated with Weights & Biases (WandB) for experiment tracking.
Data Handling: Efficient streaming and processing of large datasets (e.g., Hugging Face fineweb-edu).

📂 Project Structure

├── data/               # Dataset storage and tokenizer files
│   └── raw/           
├── notebooks/          # Experimentation and training notebooks
│   └── model_train.ipynb
├── scripts/            # standalone utility scripts
│   └── 01_data_preprocessing.py
├── src/                # Source code
│   ├── callbacks/      # Training callbacks
│   ├── configs/        # Configuration dataclasses (GPTConfig)
│   ├── data/           # Dataset, Tokenizer, Loading utilities
│   ├── inference/      # Generation logic (chat)
│   ├── model/          # model definition (Transformer, Attention, RoPE)
│   ├── training/       # Trainer loop implementation
│   └── utils/          # Helpers (caching, etc.)
└── requirements.txt    # Dependencies

🛠️ Installation

Clone the repository:

git clone https://github.com/your-username/gpt-project.git
cd gpt-project

Create a virtual environment (recommended):

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:
```
pip install -r requirements.txt
```

🏃 Usage

1. Data Preparation

You can preprocess your data using the provided scripts or load directly via the notebooks. The default setup uses fineweb-edu.

python scripts/01_data_preprocessing.py

2. Training

The main training entry point is currently organized in the Jupyter Notebook notebooks/model_train.ipynb.

Open the notebook:
```
code notebooks/model_train.ipynb
```
Configure parameters in src/configs/config.py or override them in the notebook.
Run the cells to initialize the model, tokenizer, and start the training loop.

3. Configuration

The model hyperparameters are defined in src/configs/config.py. The current default is set to replicate GPT-2 Small:

@dataclass
class GPTConfig:
    vocab_size: int = 50257
    block_size: int = 1024    # Context Window
    embedding_dim: int = 768  # Hidden Size
    num_heads: int = 12       # Attention Heads
    num_layers: int = 12      # Transformer Layers
    # ...

🧠 Model Architecture Details

The model deviates from the vanilla GPT-2 in specific modern ways to improve performance:

RoPE: Rotary Positional Embeddings are applied to queries and keys in the attention mechanism, allowing for better generalization to sequence lengths longer than seen during training.
Custom BPE: A from-scratch implementation of the Byte-Pair Encoding algorithm, ensuring full control over the vocabulary construction.

📊 Experiment Tracking

Training metrics (Loss, Learning Rate, Gradient Norm) are logged automatically to WandB. Ensure you have an account and are logged in:

wandb login

📝 License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
docs		docs
notebooks		notebooks
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
dim_attention.png		dim_attention.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPT Training Implementation

🚀 Key Features

📂 Project Structure

🛠️ Installation

🏃 Usage

1. Data Preparation

2. Training

3. Configuration

🧠 Model Architecture Details

📊 Experiment Tracking

📝 License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GPT Training Implementation

🚀 Key Features

📂 Project Structure

🛠️ Installation

🏃 Usage

1. Data Preparation

2. Training

3. Configuration

🧠 Model Architecture Details

📊 Experiment Tracking

📝 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages