Models claude model benchmark training report arxiv

Smuggled Intelligence

Chain of Thought (Every.to)by Dan Shipper / Chain of ThoughtOctober 7, 20253 min read0 views

<table><tr><td><img alt="Chain of Thought" src="https://d24ovhgu8s7341.cloudfront.net/uploads/publication/logo/59/small_chain_of_thought_logo.png" /></td><td></td><td><table><tr><td>by <a href="https://every.to/@danshipper" itemprop="name">Dan Shipper</a></td></tr><tr><td>in <a href="https://every.to/chain-of-thought">Chain of Thought</a></td></tr></table></td></tr></table><figure><img src="https://d24ovhgu8s7341.cloudfront.net/uploads/post/cover/3779/full_page_cover_Screenshot_2025-10-07_at_10.13.12_AM.png"><figcaption></figcaption></figure><p><em>Was this newsletter forwarded to you? <u><a href="https://every.to/account" rel="noopener noreferrer" target="_blank">Sign up</a></u> to get it in your inbox.</em></p><p></p><hr class="quill-line"><p></p><p>Here’s a question: Are we officially i

Was this newsletter forwarded to you? Sign up to get it in your inbox.

Here’s a question: Are we officially in the part of the movie where human experts lose their livelihoods and we realize we’ve been training our replacements the whole time?

I ask because the current rate of AI progress is both exciting and unsettling.

GPT-5 Pro has begun to cross boundaries that, until recently, felt securely human. This month, it solved Yu Tsumura’s 554th problem—a notoriously tricky exercise in abstract algebra that every major model before it had failed—producing a clean proof in 15 minutes. A week later, the noted quantum computing researcher Scott Aaronson credited GPT-5 with providing a key technical step in a proof he was working on.

OpenAI recently came out with a benchmark called GDPval, which evaluates how well AI performs real expert-level tasks drawn from 44 different occupations. For instance, one asks the model to play the role of a wholesale sales analyst: It needs to audit an Excel file of customer orders to find pricing mismatches and packaging errors, and summarize the findings and recommendations in a short report.

Overall, the research showed that GPT-5 was as good as or better than human professionals 40.6 percent of the time. Claude Opus 4.1, meanwhile, was better than human experts a whopping 49 percent of the time.

Cue a slew of headlines like, “OpenAI tool shows AI catching up to human work” from Axios, or, “AI models are already as good as experts at half of tasks, new OpenAI benchmark GDPval suggests” from Fortune.

Create a free account to continue reading

The Only SubscriptionYou Need to Stay at the Edge of AI

The essential toolkit for those shaping the future

"This might be the best value youcan get from an AI subscription."

Jay S.

Every Content

AI&I Podcast

Monologue

Cora

Sparkle

Spiral

Join 100,000+ leaders, builders, and innovators

Email address

Already have an account? Sign in

What is included in a subscription?

Daily insights from AI pioneers + early access to powerful AI tools

Front-row access to the future of AI

Bundle of AI software

Thanks for rating this post—join the conversation by commenting below.

Original source

Chain of Thought (Every.to)

https://every.to/chain-of-thought/smuggled-intelligence

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

claudemodelbenchmark

ModelsLive

Autonomous AI systems depend on data governance

Much of the current focus on AI safety has centred on models – how they are trained and monitored. But as systems become more autonomous, attention is changing toward the data those systems depend on. If the data feeding an AI system is fragmented, outdated, or lacks oversight, the system s behaviour can become more unpredictable. [ ] The post Autonomous AI systems depend on data governance appeared first on AI News .

AI News

1m16 minutes ago

ReleasesLive

Alibaba releases Qwen3.6-Plus, its third proprietary, closed-source AI model launched within a three-day period, saying it "drastically enhanced" agentic coding (Luz Ding/Bloomberg)

Luz Ding / Bloomberg : Alibaba releases Qwen3.6-Plus, its third proprietary, closed-source AI model launched within a three-day period, saying it drastically enhanced agentic coding Alibaba Group Holding Ltd. has released its third proprietary AI model in as many days, reinforcing the company's intent