Recently, AI researcher Andrej Karpathy released a tiny but powerful project: MicroGPT.
A full GPT-style model.
Written in roughly 243 lines of pure Python.
No deep learning frameworks.
No GPU dependencies.
No complex infrastructure.
Just the raw algorithm.
For many engineers, it became something like a “Hello World” moment for large language models.
Because it proves something surprising:
The core logic behind modern AI systems is not thousands of files of code.
It’s a small set of mathematical ideas repeated at scale.
And that realization often leads developers to experiment.
Which is exactly what happened next.

Training a GPT-Style Model on Healthcare Data
Inspired by this minimalist approach, I tried something different.
Instead of training on names or generic text, I trained a tiny GPT-style model on structured medical data.
Specifically:
ICD-10 codes and their descriptions pulled from ICD10Data.com.
The dataset came from a multi-tab Google Sheet.
Each sheet contained fragments of structured healthcare data that needed to be merged programmatically.
Once cleaned and unified, the dataset looked like this:
97,633 merged rows
77-token vocabulary
4,928 parameters
And just like MicroGPT, the entire training process ran with:
No PyTorch
No TensorFlow
No Hugging Face Transformers
Just pure Python with a simple autograd implementation.
The Model Architecture
The architecture was intentionally minimal.
Embedding → Linear → Softmax
That’s it.
The model operated at the character level, which means it learned from individual characters rather than tokens.
There was also:
No attention mechanism.
No context memory.
No transformer layers.
In other words, this was a deliberately constrained model.
The goal wasn’t performance.
The goal was understanding the learning dynamics.
What the Model Learned
After training, the model began generating outputs that looked like ICD-style medical entries:
A01.03 | Typhoid pneumonia
Were they perfect?
Not even close.
The training loss hovered around 4.34, which is close to a random baseline.
But something interesting still happened.
Even this tiny 5K-parameter model began learning the statistical structure of the dataset.
It learned:
• The ICD formatting pattern
• Code-description relationships
• Character distribution in medical terminology
In other words, the model learned structure without understanding.
And that’s exactly how large language models start.
What This Reveals About Large Models
When people look at systems like GPT-4 or future models like GPT-5, they often assume something mysterious is happening.
But the underlying mechanism is not magic.
The difference between a tiny notebook experiment and a frontier model comes down to five factors:
Scale
Attention mechanisms
Training data
Optimization techniques
Compute power
The mathematics is largely the same.
The magnitude is dramatically different.
Why Training Even a Tiny Model Matters
Running experiments like this changes how you think about AI systems.
When you train even a microscopic model yourself, several insights become obvious:
Why loss behaves the way it does.
Why context windows matter.
Why architecture often matters more than dataset size.
Why character-level modeling struggles with structured reasoning.
Why prompting works only when a model has memory and context.
These lessons are hard to grasp when working exclusively with API-based models.
But they become very clear when you build systems from the ground up.
What This Experiment Proved
The project produced several practical insights for healthcare AI.
Structured medical data can be programmatically merged and ingested into language models.
Domain-specific token distributions are statistically learnable.
Even complex-looking formats like ICD codes behave like structured language systems.
Most importantly:
Model architecture often limits capability more than dataset size.
What Comes Next
The next iteration of this experiment will introduce several upgrades:
Token-level modeling instead of characters.
Transformer attention layers.
Context windows that allow the model to reason across multiple tokens.
Because once context enters the system, everything changes.
Language models stop predicting isolated characters and begin modeling relationships between concepts.
The Bigger Lesson for Healthcare AI
Much of the current conversation around healthcare AI focuses on using APIs.
Plugging models into workflows.
Calling inference endpoints.
But building meaningful AI systems in healthcare requires something deeper.
It requires understanding the mathematics well enough to adapt the models to domain-specific data.
Experiments like this demonstrate that the journey toward advanced healthcare AI doesn’t always begin with massive infrastructure.
Sometimes it starts with a small dataset.
A tiny model.
And a few hundred lines of code.
From there, everything else scales.


