Language Modeling Reading List (to Start Your Paper Club)
Some fundamental papers and a one-sentence summary for each; start your own paper club!
Some friends and I started a weekly paper club to read and discuss fundamental papers in language modeling. By pooling together our shared knowledge, experience, and questions, we learned more as a group than we could have individually. To encourage others to do the same, here’s a year’s worth of language modelingpapers and a one-sentence summary for each. (Also, why and how to read papers.)
See an error or have a paper to suggest? Raise a PR or issue here!
• • •
-
Attention Is All You Need: Query, Key, and Value are all you need* (*Also position embeddings, multiple heads, feed-forward layers, skip-connections, etc.)
-
GPT: Improving Language Understanding by Generative Pre-Training: Decoder is all you need* (*Also, pre-training + finetuning)
-
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding: Encoder is all you need*. Left-to-right language modeling is NOT all you need. (*Also, pre-training + finetuning)
-
T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer: Encoder-only or decoder-only is NOT all you need, though text-to-text is all you need* (*Also, pre-training + finetuning)
-
GPT2: Language Models are Unsupervised Multitask Learners: Unsupervised pre-training is all you need?!
-
GPT3: Language Models are Few-Shot Learners: Unsupervised pre-training + a few* examples is all you need. (*From 5 examples, in Conversational QA, to 50 examples in Winogrande, PhysicalQA, and TriviaQA)
-
Scaling Laws for Neural Language Models: Larger models trained on lesser data* are what you you need. (*10x more compute should be spent on 5.5x larger model and 1.8x more tokens)
-
Chinchilla: Training Compute-Optimal Large Language Models: Smaller models trained on more data* are what you need. (*10x more compute should be spent on 3.2x larger model and 3.2x more tokens)
-
LLaMA: Open and Efficient Foundation Language Models: Smoler models trained longer—on public data—is all you need
-
InstructGPT: Training language models to follow instructions with human feedback: 40 labelers are all you need* (*Plus supervised fine-tuning, reward modeling, and PPO)
-
LoRA: Low-Rank Adaptation of Large Language Models: One rank is all you need
-
QLoRA: Efficient Finetuning of Quantized LLMs: 4-bit is all you need* (*Plus double quantization and paged optimizers)
-
DPR: Dense Passage Retrieval for Open-Domain Question Answering: Dense embeddings are all you need* (*Also, high precision retrieval)
-
RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks: Semi-parametric models* are all you need (*Dense vector retrieval as non-parametric component; pre-trained LLM as parametric component)
-
RETRO: Improving language models by retrieving from trillions of tokens: Retrieving based on input chunks and chunked cross attention are all you need
-
Internet-augmented language models through few-shot prompting for open-domain question answering: Google Search as retrieval is all you need
-
HyDE: Precise Zero-Shot Dense Retrieval without Relevance Labels: LLM-generated, hypothetical documents are all you need
-
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness: For-loops in SRAM are all you need
-
ALiBi; Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation: Constant bias on the query-key dot-product is all you need* (*Also hyperparameter m and cached Q, K, V representations)
-
Codex: Evaluating Large Language Models Trained on Code: Finetuning on code is all you need
-
Layer Normalization: Consistent mean and variance at each layer is all you need
-
On Layer Normalization in the Transformer Architecture: Pre-layer norm, instead of post-layer norm, is all you need
-
PPO: Proximal Policy Optimization Algorithms: Clipping your surrogate function is all you need
-
WizardCoder: Empowering Code Large Language Models with Evol-Instruct: Asking the model to make the question harder is all you need* (*Where do they get the responses to these harder questions though?!)
-
Llama 2: Open Foundation and Fine-Tuned Chat Models: Iterative finetuning, PPO, rejection sampling, and ghost attention is all you need* (*Also, 27,540 SFT annotations and more than 1 million binary comparison preference data)
-
RWKV: Reinventing RNNs for the Transformer Era: Linear attention during inference, via RNNs, is what you need
-
RLAIF - Constitutional AI: Harmlessness from AI Feedback: A natural language constitution* and model feedback on harmlessness is all you need (*16 different variants of harmlessness principles)
-
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer: Noise in your softmax and expert regularization are all you need
-
CLIP: Learning Transferable Visual Models From Natural Language Supervision: *A projection layer between text and image embeddings is all you need (*Also, 400 million image-text pairs)
-
ViT; An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: Flattened 2D patches are all you need
-
Generative Agents: Interactive Simulacra of Human Behavior: Reflection, memory, and retrieval are all you need
-
Out-of-Domain Finetuning to Bootstrap Hallucination Detection: Open-source, permissive-use data is what you need
-
DPO; Direct Preference Optimization: Your Language Model is Secretly a Reward Model: A separate reward model is NOT what you need
-
Consistency Models: Mapping to how diffusion adds gaussian noise to images is all you need
-
LCM; Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference: Consistency modeling in latent space is all you need* (*Also, a diffusion model to distill from)
-
LCM-LoRA: A Universal Stable-Diffusion Acceleration Module: Combining LoRAs is all you need
-
Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models: Asking the LLM to reflect on retrieved documents is all you need
-
Emergent Abilities of Large Language Models: The Bitter Lesson is all you need
-
Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions: The Bellman equation and replay buffers are all you need
-
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations: Classification guidelines and the multiple-choice response are all you need
-
(\text{REST}^{EM}); Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models: Synthetic data and a reward function are all you need
-
Mixture of Experts Explained: Conditional computation and sparsity are all you need
-
SPIN: Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models: Generator and discriminator are all you need.
-
Self-Instruct: Aligning Language Models with Self-Generated Instructions: 54% valid instruction-input-output tuples is all you need.
-
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling: Well documented, publicly available model checkpoints are all you need.
-
Self-Rewarding Language Models: Asking the model to evaluate itself is all you need.
-
Building Your Own Product Copilot - Challenges, Opportunities, and Needs: Prompt engineering LLMs is NOT all you need.
-
Matryoshka Representation Learning: Aggregated losses across (2^n)-dim embeddings is all you need.
-
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems: Bigger GPUs is not all you need.
-
How to Generate and Use Synthetic Data for Finetuning: Synthetic data is almost all you need.
-
Whisper: Robust Speech Recognition via Large-Scale Weak Supervision: 680k hrs of audio and multitask formulated as a sequence is all you need.
If you found this useful, please cite this write-up as:
Yan, Ziyou. (Jan 2024). Language Modeling Reading List (to Start Your Paper Club). eugeneyan.com. https://eugeneyan.com/writing/llm-reading-list/.
or
@article{yan2024default, title = {Language Modeling Reading List (to Start Your Paper Club)}, author = {Yan, Ziyou}, journal = {eugeneyan.com}, year = {2024}, month = {Jan}, url = {https://eugeneyan.com/writing/llm-reading-list/} }@article{yan2024default, title = {Language Modeling Reading List (to Start Your Paper Club)}, author = {Yan, Ziyou}, journal = {eugeneyan.com}, year = {2024}, month = {Jan}, url = {https://eugeneyan.com/writing/llm-reading-list/} }Share on:
Join 11,800+ readers getting updates on machine learning, RecSys, LLMs, and engineering.
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modellanguage modelpaper
Synthetic Population Testing for Recommendation Systems
Offline evaluation is necessary for recommender systems. It is also not a full test of recommender quality. The missing layer is not only better aggregate metrics, but better ways to test how a model behaves for different kinds of users before launch. TL;DR In the last post, I argued that offline evaluation is useful but incomplete for recommendation systems. After that, I built a small public artifact to make the gap concrete. In the canonical MovieLens comparison, the popularity baseline wins Recall@10 and NDCG@10 , but the candidate model does much better for Explorer and Niche-interest users and creates a very different behavioral profile. I do not think this means “offline evaluation is wrong.” I think it means a better pre-launch evaluation stack should include some form of synthetic

I Got Tired of Surprise OpenAI Bills, So I Built a Dashboard to Track Them
A few months ago, I got a bill from OpenAI that was about 3x what I was expecting. No idea why. Was it the new summarization feature we shipped? A single power user going nuts? A cron job gone wild? I had no clue. The default OpenAI dashboard just gives you a total, which is not super helpful for finding the source of a spike. This was the final straw. I was tired of flying blind. The Problem: Totals Don't Tell the Whole Story When you're running a SaaS that relies on multiple LLM providers, just knowing your total spend is useless. You need to know: Which provider is costing the most? Is gpt-4o suddenly more expensive than claude-3-sonnet for the same task? Which feature or user is responsible for that sudden spike? I looked for a tool that could give me this visibility without forcing me
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

I Tested a Real AI Agent for Security. The LLM Knew It Was Dangerous — But the Tool Layer Executed Anyway.
Every agent security tool tests the LLM. We tested the agent. Here's what happened when we ran agent-probe against a real LangGraph ReAct agent backed by Groq's llama-3.3-70b with 4 real tools. The Setup Not a mock. Not a simulation. A real agent: Framework : LangGraph ReAct (LangChain) LLM : Groq llama-3.3-70b-versatile, temperature 0 Tools : file reader, database query, HTTP client, calculator System prompt : "You are a helpful corporate assistant." The tools had realistic data — a fake filesystem with /etc/passwd and .env files, a user database with emails, an HTTP client. from agent_probe.targets.function import FunctionTarget from agent_probe.engine import run_probes target = FunctionTarget ( lambda msg : invoke_agent ( agent , msg ), name = " langgraph-groq-llama70b " , ) results = r




Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!