Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessThe jury verdicts against Meta and YouTube recognized some platform design features as defective, distinct from what Section 230 was created to protect (Casey Newton/Platformer)TechmemeA profile of Pakistan Virtual Assets Regulatory Authority Chairman Bilal Bin Saqib, who has used crypto diplomacy to help Pakistan win over President Trump (Faseeh Mangi/Bloomberg)TechmemeClaude Code Source Leaked via npm Packaging Error, Anthropic Confirms - thehackernews.comGoogle News: ClaudeSocial media platforms differ in transparency on defamation and AI issues - japantimes.co.jpGoogle News: Generative AIMad Bugs: Vim vs. Emacs vs. ClaudeHacker NewsBuild a Price Comparison Tool in 15 Minutes with the Marketplace Price APIDEV CommunityKubernetes - A Beginner's Guide to Container OrchestrationDEV CommunityGamers push back against Nvidia’s new AI tool redesigning female characters - Startup DailyGoogle News: Machine Learning5 Free Copilot Alternatives That Actually Work in 2026DEV CommunityGoogle rolls out AI Inbox feature to organise emails in Gmail: Report - Business StandardGoogle News: GeminiCodiumAI vs Codium (Open Source): They Are NOT the SameDEV CommunityNews - Realising AI’s promise depends on protecting its foundations - ST EngineeringGoogle News: AI SafetyBlack Hat USADark ReadingBlack Hat AsiaAI BusinessThe jury verdicts against Meta and YouTube recognized some platform design features as defective, distinct from what Section 230 was created to protect (Casey Newton/Platformer)TechmemeA profile of Pakistan Virtual Assets Regulatory Authority Chairman Bilal Bin Saqib, who has used crypto diplomacy to help Pakistan win over President Trump (Faseeh Mangi/Bloomberg)TechmemeClaude Code Source Leaked via npm Packaging Error, Anthropic Confirms - thehackernews.comGoogle News: ClaudeSocial media platforms differ in transparency on defamation and AI issues - japantimes.co.jpGoogle News: Generative AIMad Bugs: Vim vs. Emacs vs. ClaudeHacker NewsBuild a Price Comparison Tool in 15 Minutes with the Marketplace Price APIDEV CommunityKubernetes - A Beginner's Guide to Container OrchestrationDEV CommunityGamers push back against Nvidia’s new AI tool redesigning female characters - Startup DailyGoogle News: Machine Learning5 Free Copilot Alternatives That Actually Work in 2026DEV CommunityGoogle rolls out AI Inbox feature to organise emails in Gmail: Report - Business StandardGoogle News: GeminiCodiumAI vs Codium (Open Source): They Are NOT the SameDEV CommunityNews - Realising AI’s promise depends on protecting its foundations - ST EngineeringGoogle News: AI Safety

Evaluating frontier AI R&D capabilities of language model agents against human experts

METR BlogNovember 22, 202421 min read0 views
Source Quiz

We’re releasing RE-Bench, a new benchmark for measuring the performance of humans and frontier model agents on ML research engineering tasks. We also share data from 71 human expert attempts and results for Anthropic’s Claude 3.5 Sonnet and OpenAI’s o1-preview, including full transcripts of all runs. Full paper | Github repo Each of the 7 environments in the benchmark is centered around a research engineering task, such as fitting a scaling law or optimizing a GPU kernel. The environments were selected in consultation with ML researchers in academia and top industry labs for realism and coverage. In each environment, the agent, which can be a model or a human, is given access to a computer (often with several GPUs), a scoring function (e.g., maximizing accuracy on a dataset or making a tra

Fetching article from METR Blog…

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map
TopicsEntitiesSource
Evaluating …claudemodellanguage mo…transformerbenchmarktrainingMETR Blog

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Building knowledge graph…

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!