The Journalism Benchmark Cookbook: A Template for Benchmarking LLMs in Newsrooms - Generative AI in the Newsroom

More about

benchmark

We audited LoCoMo: 6.4% of the answer key is wrong and the judge accepts up to 63% of intentionally

Projects are still submitting new scores on LoCoMo as of March 2026. We audited it and found 6.4% of the answer key is wrong, and the LLM judge accepts up to 63% of intentionally wrong answers. LongMemEval-S is often raised as an alternative, but each question's corpus fits entirely in modern context windows, making it more of a context window test than a memory test. Here's what we found. LoCoMo LoCoMo ( Maharana et al., ACL 2024 ) is one of the most widely cited long-term memory benchmarks. We conducted a systematic audit of the ground truth and identified 99 score-corrupting errors in 1,540 questions (6.4%). Error categories include hallucinated facts in the answer key, incorrect temporal reasoning, and speaker attribution errors. Examples: The answer key specifies "Ferrari 488 GTB," bu

DEV Community

6m31 minutes ago

ModelsLive

AI Safety at the Frontier: Paper Highlights of February & March 2026

tl;dr Paper of the month: A benchmark of 56 model organisms with hidden behaviors finds that auditing-tool rankings depend heavily on how the organism was trained — and the investigator agent, not the tools, is the bottleneck. Research highlights: Linear “emotion vectors” in Claude causally drive misalignment: “desperate” steering raises blackmail from 22% to 72%, “calm” drops it to 0%. Emergent misalignment is the optimizer’s preferred solution — more efficient and more stable than staying narrowly misaligned. Scheming propensity in realistic settings is near 0%, but can dramatically increase from one prompt snippet or tool change. AI self-monitors are up to 5× more likely to approve an action shown as their own prior turn — driven by implicit cues, not stated authorship. Reasoning models

lesswrong.com

18m29 minutes ago

ProductsLive

Compute Curse

Epistemic status: romantic speculation. The core claim: I accidentally thought that compute growth can be rather neatly analogized to natural resource abundance. Before compute curse, there was resource curse Countries that discover oil often end up worse off than countries that don't, which is known as the resource curse . The mechanisms are well-understood: a booming resource sector draws capital and labor away from other industries, creates incentives for rent-seeking over productive investment, crowds out human capital development, and corrodes the institutions needed to sustain long-term growth. I argue that something structurally similar has been happening with compute. The exponential growth of available computation over the past several decades, and, critically, the widespread expe

LessWrong AI

8m41 minutes ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 135 connections

Scroll to zoom · drag to pan · click to open

More in Models

Models

Can world models unlock general purpose robotics? - Bessemer Venture Partners

Can world models unlock general purpose robotics? Bessemer Venture Partners

Google News - AI robotics

1m11 days ago

ModelsFresh

Alibaba's Omni Model: Multilingual, Multimodal, Multisecret - eWeek

Alibaba's Omni Model: Multilingual, Multimodal, Multisecret eWeek

GNews AI Alibaba

1mabout 12 hours ago

ModelsLive

We audited LoCoMo: 6.4% of the answer key is wrong and the judge accepts up to 63% of intentionally

Projects are still submitting new scores on LoCoMo as of March 2026. We audited it and found 6.4% of the answer key is wrong, and the LLM judge accepts up to 63% of intentionally wrong answers. LongMemEval-S is often raised as an alternative, but each question's corpus fits entirely in modern context windows, making it more of a context window test than a memory test. Here's what we found. LoCoMo LoCoMo ( Maharana et al., ACL 2024 ) is one of the most widely cited long-term memory benchmarks. We conducted a systematic audit of the ground truth and identified 99 score-corrupting errors in 1,540 questions (6.4%). Error categories include hallucinated facts in the answer key, incorrect temporal reasoning, and speaker attribution errors. Examples: The answer key specifies "Ferrari 488 GTB," bu

DEV Community

6m31 minutes ago

ModelsLive

AI Safety at the Frontier: Paper Highlights of February & March 2026

tl;dr Paper of the month: A benchmark of 56 model organisms with hidden behaviors finds that auditing-tool rankings depend heavily on how the organism was trained — and the investigator agent, not the tools, is the bottleneck. Research highlights: Linear “emotion vectors” in Claude causally drive misalignment: “desperate” steering raises blackmail from 22% to 72%, “calm” drops it to 0%. Emergent misalignment is the optimizer’s preferred solution — more efficient and more stable than staying narrowly misaligned. Scheming propensity in realistic settings is near 0%, but can dramatically increase from one prompt snippet or tool change. AI self-monitors are up to 5× more likely to approve an action shown as their own prior turn — driven by implicit cues, not stated authorship. Reasoning models

lesswrong.com

18m29 minutes ago

The Journalism Benchmark Cookbook: A Template for Benchmarking LLMs in Newsrooms - Generative AI in the Newsroom

Daily AI Digest

More about

We audited LoCoMo: 6.4% of the answer key is wrong and the judge accepts up to 63% of intentionally

AI Safety at the Frontier: Paper Highlights of February & March 2026

Compute Curse

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Models

Can world models unlock general purpose robotics? - Bessemer Venture Partners

Alibaba's Omni Model: Multilingual, Multimodal, Multisecret - eWeek

We audited LoCoMo: 6.4% of the answer key is wrong and the judge accepts up to 63% of intentionally

AI Safety at the Frontier: Paper Highlights of February & March 2026