Language Modeling Reading List (to Start Your Paper Club)

Eugene Yan BlogJanuary 7, 20241 min read0 views

Some fundamental papers and a one-sentence summary for each; start your own paper club!

Some friends and I started a weekly paper club to read and discuss fundamental papers in language modeling. By pooling together our shared knowledge, experience, and questions, we learned more as a group than we could have individually. To encourage others to do the same, here’s a year’s worth of language modelingpapers and a one-sentence summary for each. (Also, why and how to read papers.)

See an error or have a paper to suggest? Raise a PR or issue here!

• • •

Attention Is All You Need: Query, Key, and Value are all you need* (*Also position embeddings, multiple heads, feed-forward layers, skip-connections, etc.)
GPT: Improving Language Understanding by Generative Pre-Training: Decoder is all you need* (*Also, pre-training + finetuning)
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding: Encoder is all you need*. Left-to-right language modeling is NOT all you need. (*Also, pre-training + finetuning)
T5: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer: Encoder-only or decoder-only is NOT all you need, though text-to-text is all you need* (*Also, pre-training + finetuning)
GPT2: Language Models are Unsupervised Multitask Learners: Unsupervised pre-training is all you need?!
GPT3: Language Models are Few-Shot Learners: Unsupervised pre-training + a few* examples is all you need. (*From 5 examples, in Conversational QA, to 50 examples in Winogrande, PhysicalQA, and TriviaQA)
Scaling Laws for Neural Language Models: Larger models trained on lesser data* are what you you need. (*10x more compute should be spent on 5.5x larger model and 1.8x more tokens)
Chinchilla: Training Compute-Optimal Large Language Models: Smaller models trained on more data* are what you need. (*10x more compute should be spent on 3.2x larger model and 3.2x more tokens)
LLaMA: Open and Efficient Foundation Language Models: Smoler models trained longer—on public data—is all you need
InstructGPT: Training language models to follow instructions with human feedback: 40 labelers are all you need* (*Plus supervised fine-tuning, reward modeling, and PPO)
LoRA: Low-Rank Adaptation of Large Language Models: One rank is all you need
QLoRA: Efficient Finetuning of Quantized LLMs: 4-bit is all you need* (*Plus double quantization and paged optimizers)
DPR: Dense Passage Retrieval for Open-Domain Question Answering: Dense embeddings are all you need* (*Also, high precision retrieval)
RAG: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks: Semi-parametric models* are all you need (*Dense vector retrieval as non-parametric component; pre-trained LLM as parametric component)
RETRO: Improving language models by retrieving from trillions of tokens: Retrieving based on input chunks and chunked cross attention are all you need
Internet-augmented language models through few-shot prompting for open-domain question answering: Google Search as retrieval is all you need
HyDE: Precise Zero-Shot Dense Retrieval without Relevance Labels: LLM-generated, hypothetical documents are all you need
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness: For-loops in SRAM are all you need
ALiBi; Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation: Constant bias on the query-key dot-product is all you need* (*Also hyperparameter m and cached Q, K, V representations)
Codex: Evaluating Large Language Models Trained on Code: Finetuning on code is all you need
Layer Normalization: Consistent mean and variance at each layer is all you need
On Layer Normalization in the Transformer Architecture: Pre-layer norm, instead of post-layer norm, is all you need
PPO: Proximal Policy Optimization Algorithms: Clipping your surrogate function is all you need
WizardCoder: Empowering Code Large Language Models with Evol-Instruct: Asking the model to make the question harder is all you need* (*Where do they get the responses to these harder questions though?!)
Llama 2: Open Foundation and Fine-Tuned Chat Models: Iterative finetuning, PPO, rejection sampling, and ghost attention is all you need* (*Also, 27,540 SFT annotations and more than 1 million binary comparison preference data)
RWKV: Reinventing RNNs for the Transformer Era: Linear attention during inference, via RNNs, is what you need
RLAIF - Constitutional AI: Harmlessness from AI Feedback: A natural language constitution* and model feedback on harmlessness is all you need (*16 different variants of harmlessness principles)
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer: Noise in your softmax and expert regularization are all you need
CLIP: Learning Transferable Visual Models From Natural Language Supervision: *A projection layer between text and image embeddings is all you need (*Also, 400 million image-text pairs)
ViT; An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale: Flattened 2D patches are all you need
Generative Agents: Interactive Simulacra of Human Behavior: Reflection, memory, and retrieval are all you need
Out-of-Domain Finetuning to Bootstrap Hallucination Detection: Open-source, permissive-use data is what you need
DPO; Direct Preference Optimization: Your Language Model is Secretly a Reward Model: A separate reward model is NOT what you need
Consistency Models: Mapping to how diffusion adds gaussian noise to images is all you need
LCM; Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference: Consistency modeling in latent space is all you need* (*Also, a diffusion model to distill from)
LCM-LoRA: A Universal Stable-Diffusion Acceleration Module: Combining LoRAs is all you need
Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models: Asking the LLM to reflect on retrieved documents is all you need
Emergent Abilities of Large Language Models: The Bitter Lesson is all you need
Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions: The Bellman equation and replay buffers are all you need
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations: Classification guidelines and the multiple-choice response are all you need
(\text{REST}^{EM}); Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models: Synthetic data and a reward function are all you need
Mixture of Experts Explained: Conditional computation and sparsity are all you need
SPIN: Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models: Generator and discriminator are all you need.
Self-Instruct: Aligning Language Models with Self-Generated Instructions: 54% valid instruction-input-output tuples is all you need.
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling: Well documented, publicly available model checkpoints are all you need.
Self-Rewarding Language Models: Asking the model to evaluate itself is all you need.
Building Your Own Product Copilot - Challenges, Opportunities, and Needs: Prompt engineering LLMs is NOT all you need.
Matryoshka Representation Learning: Aggregated losses across (2^n)-dim embeddings is all you need.
Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems: Bigger GPUs is not all you need.
How to Generate and Use Synthetic Data for Finetuning: Synthetic data is almost all you need.
Whisper: Robust Speech Recognition via Large-Scale Weak Supervision: 680k hrs of audio and multitask formulated as a sequence is all you need.

If you found this useful, please cite this write-up as:

Yan, Ziyou. (Jan 2024). Language Modeling Reading List (to Start Your Paper Club). eugeneyan.com. https://eugeneyan.com/writing/llm-reading-list/.

@article{yan2024default,  title = {Language Modeling Reading List (to Start Your Paper Club)},  author = {Yan, Ziyou},  journal = {eugeneyan.com},  year = {2024},  month = {Jan},  url = {https://eugeneyan.com/writing/llm-reading-list/} }

@article{yan2024default,  title = {Language Modeling Reading List (to Start Your Paper Club)},  author = {Yan, Ziyou},  journal = {eugeneyan.com},  year = {2024},  month = {Jan},  url = {https://eugeneyan.com/writing/llm-reading-list/} }

Share on:

Join 11,800+ readers getting updates on machine learning, RecSys, LLMs, and engineering.

Original source

Eugene Yan Blog

https://eugeneyan.com//writing/llm-reading-list/

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modellanguage modelpaper

ModelsFresh

Gemma 4 Launched by Google, Bringing Powerful Open AI Models to Developers - The Bridge Chronicle

Gemma 4 Launched by Google, Bringing Powerful Open AI Models to Developers The Bridge Chronicle

GNews AI Gemma

1mabout 12 hours ago

ProductsLive

Synthetic Population Testing for Recommendation Systems

Offline evaluation is necessary for recommender systems. It is also not a full test of recommender quality. The missing layer is not only better aggregate metrics, but better ways to test how a model behaves for different kinds of users before launch. TL;DR In the last post, I argued that offline evaluation is useful but incomplete for recommendation systems. After that, I built a small public artifact to make the gap concrete. In the canonical MovieLens comparison, the popularity baseline wins Recall@10 and NDCG@10 , but the candidate model does much better for Explorer and Niche-interest users and creates a very different behavioral profile. I do not think this means “offline evaluation is wrong.” I think it means a better pre-launch evaluation stack should include some form of synthetic

DEV Community

8mabout 1 hour ago

ProductsLive

I Got Tired of Surprise OpenAI Bills, So I Built a Dashboard to Track Them

A few months ago, I got a bill from OpenAI that was about 3x what I was expecting. No idea why. Was it the new summarization feature we shipped? A single power user going nuts? A cron job gone wild? I had no clue. The default OpenAI dashboard just gives you a total, which is not super helpful for finding the source of a spike. This was the final straw. I was tired of flying blind. The Problem: Totals Don't Tell the Whole Story When you're running a SaaS that relies on multiple LLM providers, just knowing your total spend is useless. You need to know: Which provider is costing the most? Is gpt-4o suddenly more expensive than claude-3-sonnet for the same task? Which feature or user is responsible for that sudden spike? I looked for a tool that could give me this visibility without forcing me

DEV Community

5mabout 1 hour ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 145 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Models

Models

Gemini 3 Flash Review – Hands-On Tests, Accuracy & Trade-Offs - Cybernews

Gemini 3 Flash Review – Hands-On Tests, Accuracy & Trade-Offs Cybernews

Google News - AI hallucination accuracy

1m3 months ago

ModelsFresh

Gemma 4 Launched by Google, Bringing Powerful Open AI Models to Developers - The Bridge Chronicle

Gemma 4 Launched by Google, Bringing Powerful Open AI Models to Developers The Bridge Chronicle

GNews AI Gemma

1mabout 12 hours ago

ModelsLive

I Tested a Real AI Agent for Security. The LLM Knew It Was Dangerous — But the Tool Layer Executed Anyway.

Every agent security tool tests the LLM. We tested the agent. Here's what happened when we ran agent-probe against a real LangGraph ReAct agent backed by Groq's llama-3.3-70b with 4 real tools. The Setup Not a mock. Not a simulation. A real agent: Framework : LangGraph ReAct (LangChain) LLM : Groq llama-3.3-70b-versatile, temperature 0 Tools : file reader, database query, HTTP client, calculator System prompt : "You are a helpful corporate assistant." The tools had realistic data — a fake filesystem with /etc/passwd and .env files, a user database with emails, an HTTP client. from agent_probe.targets.function import FunctionTarget from agent_probe.engine import run_probes target = FunctionTarget ( lambda msg : invoke_agent ( agent , msg ), name = " langgraph-groq-llama70b " , ) results = r

DEV Community

4mabout 1 hour ago

Models

A hybrid model for improved nail disease classification using vision transformers and stable diffusion - Nature

A hybrid model for improved nail disease classification using vision transformers and stable diffusion Nature

GNews AI diffusion

1m3 months ago