YapFormer : A Transformer Implementation with Modern Optimizations ✧ദ്ദി(˵ •̀ ᴗ - ˵ ) ✧
YapFormer is a transformer model built entirely from scratch, featuring modern architectural components and efficient training optimizations.
The final model contains ~56 million parameters and was trained for 15,000 steps (~4.5 hours) on the TinyStories dataset.
Despite the small size and short training time, YapFormer produces surprisingly high-quality short stories, demonstrating that well-designed architectures can go a long way even with limited compute.
What is YapFormer? ૮ ◕ ﻌ ◕ა
YapFormer is a from-scratch GPT-style autoregressive transformer that integrates many techniques used in contemporary LLMs:
Rotary Embeddings (RoPE)
Grouped Query Attention (GQA)
KV caching for fast inference
RMSNorm
SwiGLU feed-forward layers
Mixed precision training
Gradient accumulation
Cosine decay learning rate
Gradient clipping
This project serves as both a learning exercise and a practical lightweight generative model.
Working ૮₍ • ᴥ • ₎ა
1. Input & Embeddings
Tokens are mapped using a custom tokenizer.
RoPE is applied to attention queries/keys instead of absolute positional embeddings.
2. Attention (with GQA + KV Cache)
Grouped Query Attention (GQA):
Multiple query heads share a smaller number of key/value heads → faster and more memory-efficient.KV Caching:
During inference, previous keys/values are stored so the model only attends to new tokens.
3. Transformer Blocks
Each block contains:
RMSNorm
Multi-Head Attention (with RoPE, GQA, KV cache)
SwiGLU feed-forward network
Residual connections
4. Output Projection
Final RMSNorm
Linear layer → logits → softmax for next-token prediction
5. Training Loop
Modern GPU-friendly techniques:
AMP mixed precision for speed + memory efficiency
Gradient accumulation to simulate large batch sizes
Cosine LR decay for smooth convergence
Gradient clipping to prevent instability
Architecture ૮ - ﻌ • ა
Model Structure
Token Embedding ↓ RoPE Rotary Position Encoding ↓ N × [ RMSNorm ├── Multi-Head Attention (GQA + KV Cache) ↓ Residual ↓ RMSNorm ├── SwiGLU Feed-Forward ↓ Residual ] ↓ Final RMSNorm ↓ Linear LM Head
Technology Stack ૮ฅ・ﻌ・აฅ
Language: Python
Framework: PyTorch
Built With:
- Custom attention mechanisms
- Custom embeddings
- Custom RMSNorm + SwiGLU layers
- Mixed precision training tools
Ecosystem Tools:
- 🤗 Hugging Face (datasets/tokenization)
- PyTorch (core autograd & tensor ops)
How to Run ૮⎚ﻌ⎚ა
1. Clone the Repository
git clone https://github.com/Aravind-808/YapFormer
cd YapFormer
2. Install Dependencies
pip install -r requirements.txt
3. Generate Text
python inference.py
4. Enter your prompt
Prompt: Once upon a time
5. Example Output
Once upon a time there was a tiny mouse who loved reading stories...
Built With
- huggingface
- pytorch

Log in or sign up for Devpost to join the conversation.