Extended NYT Connections Benchmark scores: MiniMax-M2.7 34.4, Gemma 4 31B 30.1, Arcee Trinity Large Thinking 29.5
More info: github.com/lechmazur/nyt-connections/ submitted by /u/zero0_one1 [link] [comments]
Could not retrieve the full article text.
Read on Reddit r/LocalLLaMA →Reddit r/LocalLLaMA
https://www.reddit.com/r/LocalLLaMA/comments/1scl7pl/extended_nyt_connections_benchmark_scores/Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
benchmarkgithub
Spent the weekend reading a local agent runtime repo. The TS-only packaging and persistent MCP ports are both very smart.
I like reading local LLM infra repos more than launch posts, and I ended up deep in one this weekend because it supports local providers like Ollama. Two things gave me the “okay, someone actually cared about runtime engineering” reaction. First, the runtime path was moved fully into TypeScript. The API layer, runner orchestration, workspace MCP hosting, and packaging all live there now, and the packaged runtime no longer ships Python source or Python deps. For local/self-hosted stacks that matters more than it sounds: smaller bundle, fewer moving pieces, less cross-language drift. Second, they stopped doing hardcoded MCP port math. Ports are persisted in SQLite with UNIQUE(port) and (workspace_id, app_id) as the key, and the runner merges prepared MCP servers during bootstrap. So local si

Comparing Qwen3.5 vs Gemma4 for Local Agentic Coding
Gemma4 was relased by Google on April 2nd earlier this week and I wanted to see how it performs against Qwen3.5 for local agentic coding. This post is my notes on benchmarking the two model families. I ran two types of tests: Standard llama-bench benchmarks for raw prefill and generation speed Single-shot agentic coding tasks using Open Code to see how these models actually perform on real multi-step coding workflows My pick is Qwen3.5-27B which is still the best model for local agentic coding on an 24GB card (RTX 3090/4090). It is reliable, efficient, produces the cleanest code and fits comfortably on a 4090. Model Gen tok/s Turn(correct) Code Quality VRAM Max Context Gemma4-26B-A4B ~135 3rd Weakest ~21 GB 256K Qwen3.5-35B-A3B ~136 2nd Best structure, wrong API ~23 GB 200K Qwen3.5-27B ~45
trunk/83e9e15421782cf018dae04969a387901ba8ec1b: Fix Python refcounting bugs in profiler_python.cpp (#179285)
Use Py_XNewRef with PyDict_GetItemString to properly convert borrowed refs to strong refs owned by THPObjectPtr (fixes leak on 3.13+ where the Py_INCREF was applied to an already-owned ref from PyMapping_GetItemString, and fixes potential NULL deref on Add Py_NewRef for Py_None passed to PyTuple_SetItem (which steals refs) Wrap PyObject_Call results in THPObjectPtr to avoid leaking return values Use PyObject_CallOneArg instead of PyTuple_Pack + PyObject_Call Clear exception from PySequence_Index when gc callback not found Remove unused thread_state_ member from ThreadLocalResults Authored with Claude. Pull Request resolved: #179285 Approved by: https://github.com/Skylion007
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Models

How well do current models handle Icelandic audio?
I’ve been doing some informal testing on how current multimodal models handle speech + multilingual understanding, and came across an interesting behavior that feels slightly beyond standard translation.I used a short audio clip in a language I don’t understand (likely Icelandic) and evaluated the output along a few dimensions:1. Transcription qualityThe model produced a relatively clean transcript, with no obvious structural breakdown.2. Translation fidelity vs. fluencyInstead of sticking closely to literal phrasing, the translation leaned more toward natural English, sometimes smoothing or rephrasing content.3. Context / tone inferenceThis was the most notable part — the model attempted to describe the tone and intent of the speakers (e.g., casual vs. serious), which goes beyond typical




Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!