A sophisticated PyTorch implementation of a specialized GPT-2 style Autoregressive Language Model (124M parameters configuration). This project focuses on clarity, modular research, and efficient training on custom datasets like FineWeb-Edu.
-
Architecture:
- GPT-2 Small Spec: 12 Layers, 12 Heads, 768 Embedding Dim.
- Modern Enhancements:
- RoPE (Rotary Positional Embeddings) instead of absolute learned embeddings.
- KVCache support for efficient inference generation.
- RMSNorm / LayerNorm configuration (configurable).
- Weight Tying between embedding and output projection layers.
- Gradient Checkpointing support for memory efficiency.
- AMP (Automatic Mixed Precision) training enabled by default.
-
Tokenizer: Custom Byte-Level BPE Tokenizer with GPT-2 regex-based pre-tokenization logic for correct handling of contractions, punctuation, and whitespaces.
-
Training Pipeline: Custom
Trainerloop integrated with Weights & Biases (WandB) for experiment tracking. -
Data Handling: Efficient streaming and processing of large datasets (e.g., Hugging Face
fineweb-edu).
├── data/ # Dataset storage and tokenizer files
│ └── raw/
├── notebooks/ # Experimentation and training notebooks
│ └── model_train.ipynb
├── scripts/ # standalone utility scripts
│ └── 01_data_preprocessing.py
├── src/ # Source code
│ ├── callbacks/ # Training callbacks
│ ├── configs/ # Configuration dataclasses (GPTConfig)
│ ├── data/ # Dataset, Tokenizer, Loading utilities
│ ├── inference/ # Generation logic (chat)
│ ├── model/ # model definition (Transformer, Attention, RoPE)
│ ├── training/ # Trainer loop implementation
│ └── utils/ # Helpers (caching, etc.)
└── requirements.txt # Dependencies
-
Clone the repository:
git clone https://github.com/your-username/gpt-project.git cd gpt-project -
Create a virtual environment (recommended):
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate
-
Install dependencies:
pip install -r requirements.txt
You can preprocess your data using the provided scripts or load directly via the notebooks. The default setup uses fineweb-edu.
python scripts/01_data_preprocessing.pyThe main training entry point is currently organized in the Jupyter Notebook notebooks/model_train.ipynb.
- Open the notebook:
code notebooks/model_train.ipynb
- Configure parameters in
src/configs/config.pyor override them in the notebook. - Run the cells to initialize the model, tokenizer, and start the training loop.
The model hyperparameters are defined in src/configs/config.py. The current default is set to replicate GPT-2 Small:
@dataclass
class GPTConfig:
vocab_size: int = 50257
block_size: int = 1024 # Context Window
embedding_dim: int = 768 # Hidden Size
num_heads: int = 12 # Attention Heads
num_layers: int = 12 # Transformer Layers
# ...The model deviates from the vanilla GPT-2 in specific modern ways to improve performance:
- RoPE: Rotary Positional Embeddings are applied to queries and keys in the attention mechanism, allowing for better generalization to sequence lengths longer than seen during training.
- Custom BPE: A from-scratch implementation of the Byte-Pair Encoding algorithm, ensuring full control over the vocabulary construction.
Training metrics (Loss, Learning Rate, Gradient Norm) are logged automatically to WandB. Ensure you have an account and are logged in:
wandb login