Live

•Black Hat USAAI Business •Black Hat AsiaAI Business •Navigating the Challenges of Cross-functional Teams: the Role of Governance and Common GoalsDEV Community •[Side B] Pursuing OSS Quality Assurance with AI: Achieving 369 Tests, 97% Coverage, and GIL-Free CompatibilityDEV Community •[Side A] Completely Defending Python from OOM Kills: The BytesIO Trap and D-MemFS 'Hard Quota' Design PhilosophyDEV Community •From Attention Economy to Thinking Economy: The AI ChallengeDEV Community •How We're Approaching a County-Level Education Data System EngagementDEV Community •I Built a Portable Text Editor for Windows — One .exe File, No Installation, Forever FreeDEV Community •Building Global Crisis Monitor: A Real-Time Geopolitical Intelligence DashboardDEV Community •Google's TurboQuant saves memory, but won't save us from DRAM-pricing hellThe Register AI/ML •Writing Better RFCs and Design DocsDEV Community •Anthropic took down thousands of Github repos trying to yank its leaked source code — a move the company says was an accidentTechCrunch •Introducing The Screwtape LaddersLessWrong AI •A Very Fine UntuningTowards AI •Black Hat USAAI Business •Black Hat AsiaAI Business •Navigating the Challenges of Cross-functional Teams: the Role of Governance and Common GoalsDEV Community •[Side B] Pursuing OSS Quality Assurance with AI: Achieving 369 Tests, 97% Coverage, and GIL-Free CompatibilityDEV Community •[Side A] Completely Defending Python from OOM Kills: The BytesIO Trap and D-MemFS 'Hard Quota' Design PhilosophyDEV Community •From Attention Economy to Thinking Economy: The AI ChallengeDEV Community •How We're Approaching a County-Level Education Data System EngagementDEV Community •I Built a Portable Text Editor for Windows — One .exe File, No Installation, Forever FreeDEV Community •Building Global Crisis Monitor: A Real-Time Geopolitical Intelligence DashboardDEV Community •Google's TurboQuant saves memory, but won't save us from DRAM-pricing hellThe Register AI/ML •Writing Better RFCs and Design DocsDEV Community •Anthropic took down thousands of Github repos trying to yank its leaked source code — a move the company says was an accidentTechCrunch •Introducing The Screwtape LaddersLessWrong AI •A Very Fine UntuningTowards AI

AI NEWS

by techtonicshifts.blog

Evaluating Different Fewshot Description Prompts on GPT-3

Evaluating Different Fewshot Description Prompts on GPT-3

EleutherAI BlogMay 24, 20211 min read0 views

We evaluate different fewshot prompts on GPT-3 to see how it changes performance.

Adam Shimi suggested the idea of trying different fewshot prompts on GPT-3, and hopefully observing something that evidenced larger models being able to handle a wider variety of prompting. He also wrote up a bunch of prompts to try on SST.

Unfortunately, the results were kinda mixed: the GPT-2 models all did absolutely terrible and their results were basically useless; the performance wasn't monotonic with model size (1.3B did better than 2.7B, and babbage did better than curie). Also, the variance increased with performance in general.

Mean Accuracy Standard Deviation in Accuracy

gpt3-ada 51.9 0.0368

gpt3-babbage 69.4 0.0840

gpt3-curie 67.4 0.0807

neo-1.3B 63.0 0.0522

neo-2.7B 56.5 0.0684

However, there was one interesting and unexpected result: there's basically no correlation between different models on which prompts do the best. This is highly unexpected because I'd expect a priori that models trained on the same/similar data should have similar preferences for what kinds of prompts work well, and that surely some prompts must be better than other prompts in general.

Here's what that looks like plotted out. Each point in these plots is one prompt, and the axes are different models. The values are SST accuracy:

The code for the experiment is here.

Original source

EleutherAI Blog

https://blog.eleuther.ai/prompts-gpt-fewshot/

Was this article helpful?

Sign in to highlight and annotate this article

Ask AI about this article

Powered by AI News Hub · full article context loaded

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 188 connections

Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models

A Very Fine Untuning

A Very Fine Untuning

How fine-tuning made my chatbot worse (and broke my RAG pipeline) I spent weeks trying to improve my personal chatbot, Virtual Alexandra , with fine-tuning. Instead I got increased hallucination rate and broken retrieval in my RAG system. Yes, this is a story about a failed attempt, not a successful one. My husband and I called fine tuning results “Drunk Alexandra” — incoherent answers that were initially funny, but quickly became annoying. After weeks of experiments, I reached a simple conclusion: for this particular project, a small chatbot that answers questions based on my writing and instructions, fine tuning was not a good option. It was not just unnecessary, it actively degraded the experience and didn’t justify the extra time, cost, or complexity compared to the prompt + RAG system

11mabout 1 hour ago

Google's TurboQuant saves memory, but won't save us from DRAM-pricing hell

Google's TurboQuant saves memory, but won't save us from DRAM-pricing hell

<h4>Chocolate Factory’s compression tech clears the way to cheaper AI inference, not more affordable memory</h4> <p>When Google unveiled <a target="_blank" rel="nofollow" href="https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/">TurboQuant</a>, an AI data compression technology that promises to slash the amount of memory required to serve models, many hoped it would help with a memory shortage that has seen prices triple since last year. Not so much.…</p>

The Register AI/ML

1m43 minutes ago

Introducing The Screwtape Ladders

Introducing The Screwtape Ladders

The time has come for me to find a new home for my writings. Like many an author before me, I've enjoyed improving my craft and getting feedback on my essays here. LessWrong is a good incubator for honing one's skills in that arena. There's a chance to get your point out in front of a really broad audience of really smart people. There's been some cool moments. My oldest visible post, Write A Thousand Roads to Rome , got cited in a discussion with Eliezer Yudkowsky once. I keep seeing people bring up Loudly Give Up, Don't Quietly Fade as a motivator for speaking out. Sometimes it's really cool people working on awesome projects, and I feel a flash of sadness at 'aww, it's not going to happen' and also a bit of cool 'whoa, they remember that post?' You've all also let me get away with a lot

3mabout 1 hour ago

Anthropic Executive Sees Cowork Agent as Bigger Than Claude Code - Bloomberg.com

Anthropic Executive Sees Cowork Agent as Bigger Than Claude Code - Bloomberg.com

<a href="https://news.google.com/rss/articles/CBMitgFBVV95cUxOM0VfSzdRYUNpT21XMlVuNXhsVEY4TUFxM3UzWUJDOEhFcUtJQnhTbjY2VjBXOUw1d1ZOUDRKeHVKMzkta3pFVWRWSGNoQkp3aWVndlRBQlpVUGxVN0ZnQW80OUZnYWN6RlhJWHRjT0V4RVhPcGhxMmE3b3oyVDlUV2RLY0g2NEx4M1dfMXhvTlhPTW50eFR1cEhxcHB3SXpURnRtbDZtZHp6bGQ2Z09IMjZBODBjdw?oc=5" target="_blank">Anthropic Executive Sees Cowork Agent as Bigger Than Claude Code</a> <font color="#6f6f6f">Bloomberg.com</font>

Google News: Claude

1mabout 1 hour ago