Published: Wed 15 April 2026
Updated: Wed 15 April 2026
By John McCardle
In programming .
What I Learned Training My First Transformer
I had a question: can you make a code-generation model with a vocabulary 50 times smaller than GPT-2's? I spent two weeks finding out.
The Question
Traditional code LLMs have 50,000+ token vocabularies. Most of those tokens are rare identifiers that appear once and never again. A function called calculate_quarterly_revenue_adjustment gets its own token, shows up in one file, and never helps the model learn anything generalizable. What if you could get away with 1,024 tokens by replacing all the identifiers with symbolic placeholders?
Instead of:
def calculate_total ( items ):
return sum ( item . price for item in items )
You'd train on:
def <| SYMBOL |><| 0 |> ( <| SYMBOL |><| 1 |> ):
return <| SYMBOL |><| 2 |> ( <| SYMBOL |><| 3 |>.<| SYMBOL |><| 4 |> for <| SYMBOL |><| 3 |> in <| SYMBOL |><| 1 |> )
The model learns structure -- how code fits together -- without wasting vocabulary on names it can't predict anyway. I went further and added <|TEXT|> placeholders for all prose content -- comments, docstrings, anything that wasn't syntactically meaningful. The model would only ever see code structure and Python reserved words.
It's the kind of idea that either works brilliantly or teaches you something important. This one did the second thing.
The Approach
The architecture was an encoder-decoder setup: a frozen ModernBERT encoder feeding into a small trainable decoder. ModernBERT already understands code structure from pre-training, so the theory was that I just needed a decoder small enough to train on my home GPU. Freeze the expensive part, train the cheap part.
The vocabulary consisted of: Python reserved words, special control tokens (a string placeholder, a comment placeholder), the <|SYMBOL|> tokens for identifiers, and numbered placeholders so the model could track which symbols referred to the same variable. That's it. 1,024 tokens total, where GPT-2 uses 50,257.
I had a curriculum learning plan: Phase 1 would train on scrambled code to learn pure syntax, Phase 2 on properly ordered code to learn semantics, and Phase 3 on instruction-following to make it actually useful. In retrospect, this was ambitious for a first transformer training project.
I named the model variants after Arizona mining towns, because this was speculative research and I wanted psychological permission to let branches become "ghost towns" if they didn't pan out. It's easier to abandon COMSTOCK than to abandon "my_model_v3_final_FINAL."
Model
Encoder
Decoder
Best Loss
Parse Rate
COMSTOCK
ModernBERT-base
27M (6L, 8H, 512d)
1.37
67-80%
BISBEE_02
ModernBERT-large
27M
1.22
85%
BISBEE_03
ModernBERT-base
50M (8L, 8H, 640d)
1.18
90-91%
BISBEE_03a
ModernBERT-base
50M
1.16
~90%
BISBEE_04
ModernBERT-large
50M
1.17
89%
The progression tells a story. COMSTOCK was the baseline proof-of-concept where I was still fixing tokenizer bugs and figuring out how plateau detection worked. BISBEE was the systematic exploration -- I tried bigger encoders, bigger decoders, and different training strategies. BISBEE_03a skipped the curriculum entirely and went straight to Phase 2 training, which turned out to work almost as well.
The best model (BISBEE_03a) generates syntactically valid Python 9 times out of 10 with a vocabulary of just 1,024 tokens.
What Worked
Decoder size matters more than encoder size. This was the clearest finding. BISBEE_03 (50M decoder with base encoder) outperformed BISBEE_02 (27M decoder with large encoder). BISBEE_04 confirmed it -- giving the decoder 50M params and using a large encoder didn't help. When in doubt, put the parameters where the generation happens.
1,024 tokens is enough for code structure. I honestly wasn't sure this would work. The symbolic vocabulary captures the structural patterns of code without needing to know what anything is called. Python has enough syntactic regularity that 1,024 tokens can describe how it fits together.
Plateau detection with early stopping prevented the models from overtraining. I added this after COMSTOCK and it turned out to be one of the most important decisions in the whole project. Simple but essential.
What Didn't Work
Catastrophic forgetting killed the curriculum. The whole plan depended on sequential training: learn syntax, then semantics, then instruction-following. COMSTOCK's Phase 3 jumped from 1.42 to 1.84 loss in a single epoch -- the model forgot everything from Phase 2. It was like watching two weeks of training evaporate. I designed a fix (mixed-phase training where 50% of each batch came from Phase 2), code-named TOMBSTONE. Infrastructure ready, never trained. More on why in a moment.
My home office became a sauna. Multi-day GPU training on my RTX blocked voice transcription and daily work, and the room hit 90 degrees Fahrenheit during training runs. I'd start a training run before bed and wake up to a tropical bedroom and a model that had plateaued at hour three. This project is honestly the single biggest reason I started pursuing a PhD -- not because the research questions require institutional guidance, but because I need institutional compute resources. I need 80+ GB of VRAM and a room that stays below 80 degrees.
The prose problem has no cheap solution. Even a perfect symbolic code generator can't name things. The model generates correct structures, but <|SYMBOL|><|24|> is not a helpful variable name. You'd still need downstream models for identifier naming, comments, and documentation. I knew this going in, but experiencing it made the limitation feel more fundamental than I'd expected.
What It Led To
The project ran from October 19 to November 3, 2025 -- two weeks of active development that taught me what's tractable in transformer training and what isn't. More practically, the encoder architecture experiments fed directly into modernbert-nli-heads , a published HuggingFace model for natural language inference. The intuitions about what ModernBERT's representations can and can't do came straight from watching BISBEE variants succeed and fail.
The Mining Town Mindset
I think the naming convention captured something important about how I approach research now. Not every vein has gold. COMSTOCK was a proof of concept that taught me what the real problems were. BISBEE was systematic exploration -- staking claims in different directions to see which had ore. TOMBSTONE was a promising lead I haven't followed up on yet, and that's fine.
The point is to prospect systematically, document what you find, and move on when the yield doesn't justify the cost. Two weeks and five models taught me that extreme vocabulary compression is viable but insufficient alone. The model can learn that Python has structure; it can't learn what the structure means to humans. That's a real finding, even if it's not the one I was hoping for.
This article was scaffolded with backblog.