Show HN: I built a tiny LLM to demystify how language models work
Comments
A ~9M parameter LLM that talks like a small fish.
This project exists to show that training your own language model is not magic. No PhD required. No massive GPU cluster. One Colab notebook, 5 minutes, and you have a working LLM that you built from scratch — data generation, tokenizer, model architecture, training loop, and inference. If you can run a notebook, you can train a language model.
It won't produce a billion-parameter model that writes essays. But it will show you exactly how every piece works — from raw text to trained weights to generated output — so the big models stop feeling like black boxes.
You> hi guppy Guppy> hi there. i just found a nice spot near the rock. the temperature feels nice.You> hi guppy Guppy> hi there. i just found a nice spot near the rock. the temperature feels nice.You> are you hungry Guppy> yes. always yes. i will swim to the top right now. i promise to eat all of it.
You> do you like bubbles Guppy> i love bubbles. they make the water feel slightly different.
You> what is the meaning of life Guppy> food. the answer is always food.
You> tell me a joke Guppy> what did the fish say when it hit the wall. dam.
You> do you love me Guppy> you're my favorite big shape. my mouth are happy when you're here.
You> goodnight guppy Guppy> ok sleep time. i was following a bubble but now i'll stop. goodnight tank. goodnight water.`
What is GuppyLM?
GuppyLM is a tiny language model that pretends to be a fish named Guppy. It speaks in short, lowercase sentences about water, food, light, and tank life. It doesn't understand human abstractions like money, phones, or politics — and it's not trying to.
It's trained from scratch on 60K synthetic conversations across 60 topics, runs on a single GPU in ~5 minutes, and produces a model small enough to run in a browser.
Architecture
Parameters 8.7M
Layers 6
Hidden dim 384
Heads 6
FFN 768 (ReLU)
Vocab 4,096 (BPE)
Max sequence 128 tokens
Norm LayerNorm
Position Learned embeddings
LM head Weight-tied with embeddings
Vanilla transformer. No GQA, no RoPE, no SwiGLU, no early exit. As simple as it gets.
Personality
Guppy:
-
Speaks in short, lowercase sentences
-
Experiences the world through water, temperature, light, vibrations, and food
-
Doesn't understand human abstractions
-
Is friendly, curious, and a little dumb
-
Thinks about food a lot
60 topics: greetings, feelings, temperature, food, light, water, tank, noise, night, loneliness, bubbles, glass, reflection, breathing, swimming, colors, taste, plants, filter, algae, snails, scared, excited, bored, curious, happy, tired, outside, cats, rain, seasons, music, visitors, children, meaning of life, time, memory, dreams, size, future, past, name, weather, sleep, friends, jokes, fear, love, age, intelligence, health, singing, TV, and more.
Quick Start
Chat with Guppy (no training needed)
Downloads the pre-trained model from HuggingFace and lets you chat. Just run all cells.
Train your own
-
Set runtime to T4 GPU
-
Run all cells — downloads dataset, trains tokenizer, trains model, tests it
-
Upload to HuggingFace or download locally
Chat locally
pip install torch tokenizers python -m guppylm chatpip install torch tokenizers python -m guppylm chatYou> the cat is looking at you Guppy> i hide behind the plant when the furry one comes.You> the cat is looking at you Guppy> i hide behind the plant when the furry one comes.You> it is raining outside Guppy> i think rain is the best thing about outside.`
Dataset
arman-bd/guppylm-60k-generic on HuggingFace.
Samples 60,000 (57K train / 3K test)
Format
{"input": "...", "output": "...", "category": "..."}
Categories 60
Generation Synthetic template composition
from datasets import load_dataset ds = load_dataset("arman-bd/guppylm-60k-generic") print(ds["train"][0])from datasets import load_dataset ds = load_dataset("arman-bd/guppylm-60k-generic") print(ds["train"][0]){'input': 'hi guppy', 'output': 'hello. the water is nice today.', 'category': 'greeting'}`
Project Structure
guppylm/ ├── config.py Hyperparameters (model + training) ├── model.py Vanilla transformer ├── dataset.py Data loading + batching ├── train.py Training loop (cosine LR, AMP) ├── generate_data.py Conversation data generator (60 topics) ├── eval_cases.py Held-out test cases ├── prepare_data.py Data prep + tokenizer training └── inference.py Chat interfaceguppylm/ ├── config.py Hyperparameters (model + training) ├── model.py Vanilla transformer ├── dataset.py Data loading + batching ├── train.py Training loop (cosine LR, AMP) ├── generate_data.py Conversation data generator (60 topics) ├── eval_cases.py Held-out test cases ├── prepare_data.py Data prep + tokenizer training └── inference.py Chat interfacetools/ ├── make_colab.py Generates guppy_colab.ipynb ├── export_dataset.py Push dataset to HuggingFace └── dataset_card.md HuggingFace dataset README`
Design Decisions
Why no system prompt? Every training sample had the same one. A 9M model can't conditionally follow instructions — the personality is baked into the weights. Removing it saves ~60 tokens per inference.
Why single-turn only? Multi-turn degraded at turn 3-4 due to the 128-token context window. A fish that forgets is on-brand, but garbled output isn't. Single-turn is reliable.
Why vanilla transformer? GQA, SwiGLU, RoPE, and early exit add complexity that doesn't help at 9M params. Standard attention + ReLU FFN + LayerNorm produces the same quality with simpler code.
Why synthetic data? A fish character with consistent personality needs consistent training data. Template composition with randomized components (30 tank objects, 17 food types, 25 activities) generates ~16K unique outputs from ~60 templates.
License
MIT
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modellanguage model
The portability paradox of foundation models for clinical decision support
npj Digital Medicine, Published online: 07 April 2026; doi:10.1038/s41746-026-02615-4 Yakdan et al. demonstrate that foundation models (FMs) trained to predict cervical spondylotic myelopathy from electronic health record data outperform traditional models on internal datasets but lose their advantage during external validation. This suggests that the feature-dense patterns learned by FMs may reduce their portability across settings, particularly for rare outcomes. As FMs approach clinical deployment, local validation, subgroup analysis, and attention to implementation burden are essential to inform health system planning and stewardship.

Robust Regression with Adaptive Contamination in Response: Optimal Rates and Computational Barriers
arXiv:2604.04228v1 Announce Type: cross Abstract: We study robust regression under a contamination model in which covariates are clean while the responses may be corrupted in an adaptive manner. Unlike the classical Huber's contamination model, where both covariates and responses may be contaminated and consistent estimation is impossible when the contamination proportion is a non-vanishing constant, it turns out that the clean-covariate setting admits strictly improved statistical guarantees. Specifically, we show that the additional information in the clean covariates can be carefully exploited to construct an estimator that achieves a better estimation rate than that attainable under Huber contamination. In contrast to the Huber model, this improved rate implies consistency even when th

The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models
arXiv:2604.04155v1 Announce Type: cross Abstract: Foundation models for biology and physics optimize predictive accuracy, but their internal representations systematically fail to preserve the continuous geometry of the systems they model. We identify the root cause: the Geometric Alignment Tax, an intrinsic cost of forcing continuous manifolds through discrete categorical bottlenecks. Controlled ablations on synthetic dynamical systems demonstrate that replacing cross-entropy with a continuous head on an identical encoder reduces geometric distortion by up to 8.5x, while learned codebooks exhibit a non-monotonic double bind where finer quantization worsens geometry despite improving reconstruction. Under continuous objectives, three architectures differ by 1.3x; under discrete tokenizatio
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

The portability paradox of foundation models for clinical decision support
npj Digital Medicine, Published online: 07 April 2026; doi:10.1038/s41746-026-02615-4 Yakdan et al. demonstrate that foundation models (FMs) trained to predict cervical spondylotic myelopathy from electronic health record data outperform traditional models on internal datasets but lose their advantage during external validation. This suggests that the feature-dense patterns learned by FMs may reduce their portability across settings, particularly for rare outcomes. As FMs approach clinical deployment, local validation, subgroup analysis, and attention to implementation burden are essential to inform health system planning and stewardship.

The Geometric Alignment Tax: Tokenization vs. Continuous Geometry in Scientific Foundation Models
arXiv:2604.04155v1 Announce Type: cross Abstract: Foundation models for biology and physics optimize predictive accuracy, but their internal representations systematically fail to preserve the continuous geometry of the systems they model. We identify the root cause: the Geometric Alignment Tax, an intrinsic cost of forcing continuous manifolds through discrete categorical bottlenecks. Controlled ablations on synthetic dynamical systems demonstrate that replacing cross-entropy with a continuous head on an identical encoder reduces geometric distortion by up to 8.5x, while learned codebooks exhibit a non-monotonic double bind where finer quantization worsens geometry despite improving reconstruction. Under continuous objectives, three architectures differ by 1.3x; under discrete tokenizatio


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!