Live

•Black Hat USADark Reading •Black Hat AsiaAI Business •The jury verdicts against Meta and YouTube recognized some platform design features as defective, distinct from what Section 230 was created to protect (Casey Newton/Platformer)Techmeme •A profile of Pakistan Virtual Assets Regulatory Authority Chairman Bilal Bin Saqib, who has used crypto diplomacy to help Pakistan win over President Trump (Faseeh Mangi/Bloomberg)Techmeme •Claude Code Source Leaked via npm Packaging Error, Anthropic Confirms - thehackernews.comGoogle News: Claude •Social media platforms differ in transparency on defamation and AI issues - japantimes.co.jpGoogle News: Generative AI •Mad Bugs: Vim vs. Emacs vs. ClaudeHacker News •Build a Price Comparison Tool in 15 Minutes with the Marketplace Price APIDEV Community •Kubernetes - A Beginner's Guide to Container OrchestrationDEV Community •Gamers push back against Nvidia’s new AI tool redesigning female characters - Startup DailyGoogle News: Machine Learning •5 Free Copilot Alternatives That Actually Work in 2026DEV Community •Google rolls out AI Inbox feature to organise emails in Gmail: Report - Business StandardGoogle News: Gemini •CodiumAI vs Codium (Open Source): They Are NOT the SameDEV Community •News - Realising AI’s promise depends on protecting its foundations - ST EngineeringGoogle News: AI Safety •Black Hat USADark Reading •Black Hat AsiaAI Business •The jury verdicts against Meta and YouTube recognized some platform design features as defective, distinct from what Section 230 was created to protect (Casey Newton/Platformer)Techmeme •A profile of Pakistan Virtual Assets Regulatory Authority Chairman Bilal Bin Saqib, who has used crypto diplomacy to help Pakistan win over President Trump (Faseeh Mangi/Bloomberg)Techmeme •Claude Code Source Leaked via npm Packaging Error, Anthropic Confirms - thehackernews.comGoogle News: Claude •Social media platforms differ in transparency on defamation and AI issues - japantimes.co.jpGoogle News: Generative AI •Mad Bugs: Vim vs. Emacs vs. ClaudeHacker News •Build a Price Comparison Tool in 15 Minutes with the Marketplace Price APIDEV Community •Kubernetes - A Beginner's Guide to Container OrchestrationDEV Community •Gamers push back against Nvidia’s new AI tool redesigning female characters - Startup DailyGoogle News: Machine Learning •5 Free Copilot Alternatives That Actually Work in 2026DEV Community •Google rolls out AI Inbox feature to organise emails in Gmail: Report - Business StandardGoogle News: Gemini •CodiumAI vs Codium (Open Source): They Are NOT the SameDEV Community •News - Realising AI’s promise depends on protecting its foundations - ST EngineeringGoogle News: AI Safety

AI NEWS

by techtonicshifts.blog

Evaluating frontier AI R&D capabilities of language model agents against human experts

Models claude model language model transformer benchmark training

Evaluating frontier AI R&D capabilities of language model agents against human experts

METR BlogNovember 22, 202421 min read0 views

We’re releasing RE-Bench, a new benchmark for measuring the performance of humans and frontier model agents on ML research engineering tasks. We also share data from 71 human expert attempts and results for Anthropic’s Claude 3.5 Sonnet and OpenAI’s o1-preview, including full transcripts of all runs. Full paper | Github repo Each of the 7 environments in the benchmark is centered around a research engineering task, such as fitting a scaling law or optimizing a GPU kernel. The environments were selected in consultation with ML researchers in academia and top industry labs for realism and coverage. In each environment, the agent, which can be a model or a human, is given access to a computer (often with several GPUs), a scoring function (e.g., maximizing accuracy on a dataset or making a tra

Fetching article from METR Blog…

Original source

METR Blog

https://metr.org/blog/2024-11-22-evaluating-r-d-capabilities-of-llms/

Was this article helpful?

Sign in to highlight and annotate this article

Ask AI about this article

Powered by AI News Hub · full article context loaded

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

Knowledge Map

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Building knowledge graph…

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!