The Journalism Benchmark Cookbook: A Template for Benchmarking LLMs in Newsrooms - Generative AI in the Newsroom
<a href="https://news.google.com/rss/articles/CBMixgFBVV95cUxOU3J3X1AwcEVKVzJRbEFLUl81c0laR2xHbVd1YjV5dmR0NVhuclRLa3RReThibWh4dWpPSkZFV1lLOGRhdkpmcGQ1bU1aY2xPVGpiNUtSRVNodUtCQkpweWV5SDdrRXF0RjlVdnRwYTRKMmNvbWk5OTJrR0JVNDRYMW9OTm9SNURMbS1fZlhlcXlGUUI4d2ZJdkJEZGtZUER1T09LaHV5UmQyVjBUMG9zYTkwcGhwZloxUFF4NlhPNnhMX05OUFE?oc=5" target="_blank">The Journalism Benchmark Cookbook: A Template for Benchmarking LLMs in Newsrooms</a> <font color="#6f6f6f">Generative AI in the Newsroom</font>
Could not retrieve the full article text.
Read on GNews AI benchmark →Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
benchmark
We audited LoCoMo: 6.4% of the answer key is wrong and the judge accepts up to 63% of intentionally
Projects are still submitting new scores on LoCoMo as of March 2026. We audited it and found 6.4% of the answer key is wrong, and the LLM judge accepts up to 63% of intentionally wrong answers. LongMemEval-S is often raised as an alternative, but each question's corpus fits entirely in modern context windows, making it more of a context window test than a memory test. Here's what we found. LoCoMo LoCoMo ( Maharana et al., ACL 2024 ) is one of the most widely cited long-term memory benchmarks. We conducted a systematic audit of the ground truth and identified 99 score-corrupting errors in 1,540 questions (6.4%). Error categories include hallucinated facts in the answer key, incorrect temporal reasoning, and speaker attribution errors. Examples: The answer key specifies "Ferrari 488 GTB," bu

AI Safety at the Frontier: Paper Highlights of February & March 2026
tl;dr Paper of the month: A benchmark of 56 model organisms with hidden behaviors finds that auditing-tool rankings depend heavily on how the organism was trained — and the investigator agent, not the tools, is the bottleneck. Research highlights: Linear “emotion vectors” in Claude causally drive misalignment: “desperate” steering raises blackmail from 22% to 72%, “calm” drops it to 0%. Emergent misalignment is the optimizer’s preferred solution — more efficient and more stable than staying narrowly misaligned. Scheming propensity in realistic settings is near 0%, but can dramatically increase from one prompt snippet or tool change. AI self-monitors are up to 5× more likely to approve an action shown as their own prior turn — driven by implicit cues, not stated authorship. Reasoning models

Compute Curse
Epistemic status: romantic speculation. The core claim: I accidentally thought that compute growth can be rather neatly analogized to natural resource abundance. Before compute curse, there was resource curse Countries that discover oil often end up worse off than countries that don't, which is known as the resource curse . The mechanisms are well-understood: a booming resource sector draws capital and labor away from other industries, creates incentives for rent-seeking over productive investment, crowds out human capital development, and corrodes the institutions needed to sustain long-term growth. I argue that something structurally similar has been happening with compute. The exponential growth of available computation over the past several decades, and, critically, the widespread expe
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

We audited LoCoMo: 6.4% of the answer key is wrong and the judge accepts up to 63% of intentionally
Projects are still submitting new scores on LoCoMo as of March 2026. We audited it and found 6.4% of the answer key is wrong, and the LLM judge accepts up to 63% of intentionally wrong answers. LongMemEval-S is often raised as an alternative, but each question's corpus fits entirely in modern context windows, making it more of a context window test than a memory test. Here's what we found. LoCoMo LoCoMo ( Maharana et al., ACL 2024 ) is one of the most widely cited long-term memory benchmarks. We conducted a systematic audit of the ground truth and identified 99 score-corrupting errors in 1,540 questions (6.4%). Error categories include hallucinated facts in the answer key, incorrect temporal reasoning, and speaker attribution errors. Examples: The answer key specifies "Ferrari 488 GTB," bu

AI Safety at the Frontier: Paper Highlights of February & March 2026
tl;dr Paper of the month: A benchmark of 56 model organisms with hidden behaviors finds that auditing-tool rankings depend heavily on how the organism was trained — and the investigator agent, not the tools, is the bottleneck. Research highlights: Linear “emotion vectors” in Claude causally drive misalignment: “desperate” steering raises blackmail from 22% to 72%, “calm” drops it to 0%. Emergent misalignment is the optimizer’s preferred solution — more efficient and more stable than staying narrowly misaligned. Scheming propensity in realistic settings is near 0%, but can dramatically increase from one prompt snippet or tool change. AI self-monitors are up to 5× more likely to approve an action shown as their own prior turn — driven by implicit cues, not stated authorship. Reasoning models


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!