Evaluating Different Fewshot Description Prompts on GPT-3
We evaluate different fewshot prompts on GPT-3 to see how it changes performance.
Adam Shimi suggested the idea of trying different fewshot prompts on GPT-3, and hopefully observing something that evidenced larger models being able to handle a wider variety of prompting. He also wrote up a bunch of prompts to try on SST.
Unfortunately, the results were kinda mixed: the GPT-2 models all did absolutely terrible and their results were basically useless; the performance wasn't monotonic with model size (1.3B did better than 2.7B, and babbage did better than curie). Also, the variance increased with performance in general.
Mean Accuracy Standard Deviation in Accuracy
gpt3-ada 51.9 0.0368
gpt3-babbage 69.4 0.0840
gpt3-curie 67.4 0.0807
neo-1.3B 63.0 0.0522
neo-2.7B 56.5 0.0684
However, there was one interesting and unexpected result: there's basically no correlation between different models on which prompts do the best. This is highly unexpected because I'd expect a priori that models trained on the same/similar data should have similar preferences for what kinds of prompts work well, and that surely some prompts must be better than other prompts in general.
Here's what that looks like plotted out. Each point in these plots is one prompt, and the axes are different models. The values are SST accuracy:
The code for the experiment is here.
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models
A Very Fine Untuning
How fine-tuning made my chatbot worse (and broke my RAG pipeline) I spent weeks trying to improve my personal chatbot, Virtual Alexandra , with fine-tuning. Instead I got increased hallucination rate and broken retrieval in my RAG system. Yes, this is a story about a failed attempt, not a successful one. My husband and I called fine tuning results “Drunk Alexandra” — incoherent answers that were initially funny, but quickly became annoying. After weeks of experiments, I reached a simple conclusion: for this particular project, a small chatbot that answers questions based on my writing and instructions, fine tuning was not a good option. It was not just unnecessary, it actively degraded the experience and didn’t justify the extra time, cost, or complexity compared to the prompt + RAG system

Google's TurboQuant saves memory, but won't save us from DRAM-pricing hell
<h4>Chocolate Factory’s compression tech clears the way to cheaper AI inference, not more affordable memory</h4> <p>When Google unveiled <a target="_blank" rel="nofollow" href="https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/">TurboQuant</a>, an AI data compression technology that promises to slash the amount of memory required to serve models, many hoped it would help with a memory shortage that has seen prices triple since last year. Not so much.…</p>

Introducing The Screwtape Ladders
The time has come for me to find a new home for my writings. Like many an author before me, I've enjoyed improving my craft and getting feedback on my essays here. LessWrong is a good incubator for honing one's skills in that arena. There's a chance to get your point out in front of a really broad audience of really smart people. There's been some cool moments. My oldest visible post, Write A Thousand Roads to Rome , got cited in a discussion with Eliezer Yudkowsky once. I keep seeing people bring up Loudly Give Up, Don't Quietly Fade as a motivator for speaking out. Sometimes it's really cool people working on awesome projects, and I feel a flash of sadness at 'aww, it's not going to happen' and also a bit of cool 'whoa, they remember that post?' You've all also let me get away with a lot
Anthropic Executive Sees Cowork Agent as Bigger Than Claude Code - Bloomberg.com
<a href="https://news.google.com/rss/articles/CBMitgFBVV95cUxOM0VfSzdRYUNpT21XMlVuNXhsVEY4TUFxM3UzWUJDOEhFcUtJQnhTbjY2VjBXOUw1d1ZOUDRKeHVKMzkta3pFVWRWSGNoQkp3aWVndlRBQlpVUGxVN0ZnQW80OUZnYWN6RlhJWHRjT0V4RVhPcGhxMmE3b3oyVDlUV2RLY0g2NEx4M1dfMXhvTlhPTW50eFR1cEhxcHB3SXpURnRtbDZtZHp6bGQ2Z09IMjZBODBjdw?oc=5" target="_blank">Anthropic Executive Sees Cowork Agent as Bigger Than Claude Code</a> <font color="#6f6f6f">Bloomberg.com</font>
Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!