Skip to content

olavodd42/microGPT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPT Training Implementation

A sophisticated PyTorch implementation of a specialized GPT-2 style Autoregressive Language Model (124M parameters configuration). This project focuses on clarity, modular research, and efficient training on custom datasets like FineWeb-Edu.

🚀 Key Features

  • Architecture:

    • GPT-2 Small Spec: 12 Layers, 12 Heads, 768 Embedding Dim.
    • Modern Enhancements:
      • RoPE (Rotary Positional Embeddings) instead of absolute learned embeddings.
      • KVCache support for efficient inference generation.
      • RMSNorm / LayerNorm configuration (configurable).
      • Weight Tying between embedding and output projection layers.
    • Gradient Checkpointing support for memory efficiency.
    • AMP (Automatic Mixed Precision) training enabled by default.
  • Tokenizer: Custom Byte-Level BPE Tokenizer with GPT-2 regex-based pre-tokenization logic for correct handling of contractions, punctuation, and whitespaces.

  • Training Pipeline: Custom Trainer loop integrated with Weights & Biases (WandB) for experiment tracking.

  • Data Handling: Efficient streaming and processing of large datasets (e.g., Hugging Face fineweb-edu).

📂 Project Structure

├── data/               # Dataset storage and tokenizer files
│   └── raw/           
├── notebooks/          # Experimentation and training notebooks
│   └── model_train.ipynb
├── scripts/            # standalone utility scripts
│   └── 01_data_preprocessing.py
├── src/                # Source code
│   ├── callbacks/      # Training callbacks
│   ├── configs/        # Configuration dataclasses (GPTConfig)
│   ├── data/           # Dataset, Tokenizer, Loading utilities
│   ├── inference/      # Generation logic (chat)
│   ├── model/          # model definition (Transformer, Attention, RoPE)
│   ├── training/       # Trainer loop implementation
│   └── utils/          # Helpers (caching, etc.)
└── requirements.txt    # Dependencies

🛠️ Installation

  1. Clone the repository:

    git clone https://github.com/your-username/gpt-project.git
    cd gpt-project
  2. Create a virtual environment (recommended):

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
  3. Install dependencies:

    pip install -r requirements.txt

🏃 Usage

1. Data Preparation

You can preprocess your data using the provided scripts or load directly via the notebooks. The default setup uses fineweb-edu.

python scripts/01_data_preprocessing.py

2. Training

The main training entry point is currently organized in the Jupyter Notebook notebooks/model_train.ipynb.

  1. Open the notebook:
    code notebooks/model_train.ipynb
  2. Configure parameters in src/configs/config.py or override them in the notebook.
  3. Run the cells to initialize the model, tokenizer, and start the training loop.

3. Configuration

The model hyperparameters are defined in src/configs/config.py. The current default is set to replicate GPT-2 Small:

@dataclass
class GPTConfig:
    vocab_size: int = 50257
    block_size: int = 1024    # Context Window
    embedding_dim: int = 768  # Hidden Size
    num_heads: int = 12       # Attention Heads
    num_layers: int = 12      # Transformer Layers
    # ...

🧠 Model Architecture Details

The model deviates from the vanilla GPT-2 in specific modern ways to improve performance:

  • RoPE: Rotary Positional Embeddings are applied to queries and keys in the attention mechanism, allowing for better generalization to sequence lengths longer than seen during training.
  • Custom BPE: A from-scratch implementation of the Byte-Pair Encoding algorithm, ensuring full control over the vocabulary construction.

📊 Experiment Tracking

Training metrics (Loss, Learning Rate, Gradient Norm) are logged automatically to WandB. Ensure you have an account and are logged in:

wandb login

📝 License

MIT License

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors