b8603
<details open=""> <p>CANN: fix multi-thread set_tensor race conditions (<a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="4031773941" data-permission-text="Title is private" data-url="https://github.com/ggml-org/llama.cpp/issues/20151" data-hovercard-type="pull_request" data-hovercard-url="/ggml-org/llama.cpp/pull/20151/hovercard" href="https://github.com/ggml-org/llama.cpp/pull/20151">#20151</a>)</p> <ul> <li>CANN: fix multi-thread set_tensor race conditions</li> </ul> <p>When ollama calls ggml_backend_tensor_set from multiple threads (each<br> writing a different chunk of the same tensor), the CANN backend had<br> three concurrency issues:</p> <ol> <li> <p>Quantized tensors (Q4_0/Q8_0) require a full-tensor format transform<br> before uploading to device
CANN: fix multi-thread set_tensor race conditions (#20151)
- CANN: fix multi-thread set_tensor race conditions
When ollama calls ggml_backend_tensor_set from multiple threads (each writing a different chunk of the same tensor), the CANN backend had three concurrency issues:
-
Quantized tensors (Q4_0/Q8_0) require a full-tensor format transform before uploading to device. Per-chunk transforms produced corrupt data.
-
ND-to-NZ weight conversion requires complete tensor data on device. Per-chunk conversion operated on incomplete data.
-
The global g_nz_workspaces array had unprotected concurrent access.
Fix by introducing a TensorSetTracker that accumulates write progress per tensor. For quantized tensors, raw data is staged in a host buffer and the transform + upload is deferred until all chunks arrive. For NZ weights, chunks are uploaded directly but conversion is deferred. The tracker and its staging buffer are released immediately after post-processing completes.
Add per-device mutex to g_nz_workspaces to prevent data races.
- CANN: fix L2_NORM ignoring eps parameter
The L2_NORM implementation was not using the eps parameter from op_params, causing incorrect results when eps is large (e.g. 10.0). The CPU reference computes scale = 1/fmaxf(norm, eps), so add a Clamp step to clamp the norm to at least eps before dividing.
- ggml/cann: compare op_params for POOL_2D in ACL graph cache matching
When ACL graph mode is enabled, the graph LRU cache checks whether a cached graph matches the current computation graph. Previously, GGML_OP_POOL_2D was not included in the op_params comparison, so two POOL_2D nodes with different pooling parameters (kernel size, stride, padding) but identical tensor shapes and addresses could incorrectly reuse a cached graph, leading to wrong results or aclnn errors.
Add GGML_OP_POOL_2D to the list of ops that require op_params matching in ggml_graph_node_properties::has_matching_properties().
- cann: fix ACL graph cache matching by adding tensor type and unconditional op_params comparison
The ACL graph LRU cache was incorrectly reusing cached graphs for operations with different tensor types or op_params, causing test failures for CPY (f16 vs bf16), POOL_2D, L2_NORM, NORM_MUL_ADD, RMS_NORM_MUL_ADD, and ADD_RMS_NORM.
Changes:
-
Add node_type and src_type[] fields to ggml_graph_node_properties so the cache can distinguish tensors with different types but identical ne/nb (e.g. f16 and bf16 both have 2-byte elements)
-
Compare op_params unconditionally for all ops instead of only for SCALE/UNARY/GLU/ROPE/POOL_2D
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
llamareleaseversionMLCommons Releases New MLPerf Inference v6.0 Benchmark Results - HPCwire
<a href="https://news.google.com/rss/articles/CBMisgFBVV95cUxOSFFkZHVvNWJCSHY1ZTQ4NWlLdUVjMVZsS0FVNHRpQy1SSEJ5ZWxZRk9yVnhrdjZyRnQxLTRkenlkS0hmMWh6YnRsNkJDT0NsdEM4RTM4Sm9fTGNVdG85Vm5pT0VRZzRaZmJxcVlzUHVCYTViWnMwaFJsendTaFhBa0VEM1R3TVZYNU5nS1BxVzE3cXVMT3dBcmdzZ2sxeC1OR2lPbWFkNEo0RnR6dER5UTFB?oc=5" target="_blank">MLCommons Releases New MLPerf Inference v6.0 Benchmark Results</a> <font color="#6f6f6f">HPCwire</font>
Monorepo Architecture with pnpm Workspace, Turborepo & Changesets 📦
<p>When you're developing a project with multiple packages, managing each one in its own repo can quickly turn into a nightmare. In this article, we'll set up a monorepo architecture from scratch using <strong>pnpm workspace</strong>, speed up build processes with <strong>Turborepo</strong>, and build an automated NPM publish pipeline with <strong>Changesets</strong>.</p> <h2> 🏗️ What Is a Monorepo? </h2> <p>Let's say you're building a design system. You have a core package, a theme package, and a utils package. Now imagine keeping all of these in <strong>separate repositories</strong>.</p> <p>When you fix a bug in the core package, what happens? You switch to the theme repo and update the dependency. Then you switch to the utils repo. You open separate PRs for each, wait for separate CI/
I Built an OPA Plugin That Turns It Into an AuthZEN-Compatible PDP
<h1> Introduction </h1> <p>In my <a href="https://dev.to/kanywst/authzen-authorization-api-10-deep-dive-the-standard-api-that-separates-authorization-decisions-1m2a">previous article</a>, I did a deep dive into the AuthZEN Authorization API 1.0 spec. It standardizes communication between PEPs and PDPs. You send a JSON request asking "can this subject do this action on this resource?" and get back <code>{"decision": true/false}</code>.</p> <p>So the spec makes sense. But how do you actually use OPA as an AuthZEN-compatible PDP?</p> <p>OPA already has a REST API (<code>POST /v1/data/...</code>), but it doesn't match the AuthZEN API.</p> <ul> <li>Different path: AuthZEN uses <code>POST /access/v1/evaluation</code> </li> <li>Different request structure: OPA requires wrapping in <code>{"input":
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Open Source AI

I Built a Social Post Engine to Escape the Canva-Export-Schedule Loop
<p>As a solo founder running WahResume.com, I was spending way too much time on social media - not on creativity, but on process.<br> Same templates. Same brand assets. Same hashtags. Every post meant opening Canva, exporting, uploading, scheduling… and repeating it the next day.</p> <p>So I built something to fix that.</p> <p>Social Post Engine is a small tool that helps me stay consistent on social media without having to touch Canva or an endless queue of schedulers.</p> <p>Here’s what it does:</p> <p>✅ Seed & review topics in one command — it researches, outlines, and preps your next posts.<br> ✅ Pre-generates branded images from templates (checklists, stat cards, charts, comparisons). It also writes captions in your brand’s voice using AI.<br> ✅ Publishes automatically to LinkedIn

Escaping API Quotas: How I Built a Local 14B Multi-Agent Squad for 16GB VRAM (Qwen3.5 & DeepSeek-R1)
<p>I was building a complex web app prototype using a cloud-based AI IDE. Just as I was getting into the flow, I hit the dreaded wall: <strong>"429 Too Many Requests"</strong>. </p> <p>I was done dealing with subscription anxiety and 6-day quota limits. I wanted to offload the heavy cognitive work to my local machine. But there was a catch: my rig runs on an AMD Radeon RX 6800 with <strong>16GB of VRAM</strong>.</p> <p>Here is how I bypassed the cloud limits and built a fully functional local multi-agent system without melting my GPU.</p> <h3> The "Goldilocks" Zone: Why 14B? </h3> <p>Running a multi-agent system locally is tricky when you have strict hardware limits. Through trial and error, I quickly realized:</p> <ul> <li> <strong>7B/8B models?</strong> They are fast, but too prone to ha

I'm 18 and Built an Open-Source Camera That Cryptographically Proves Photos Are Real
<p>In 2026, generating a photorealistic fake image takes seconds. The C2PA standard (Adobe, Microsoft, Google) solves this with Content Credentials — but only on Samsung S25+ and Pixel 10. The other 3 billion Android phones have nothing.</p> <p>I'm 18, from Brazil, and I built <a href="https://github.com/YuriTheCoder/TrueShot" rel="noopener noreferrer">TrueShot</a> to change that.</p> <h2> What happens when you take a photo </h2> <ol> <li> <strong>14 physical sensors</strong> are sampled at the exact instant of the shutter — accelerometer, gyroscope, magnetometer, barometer, light, proximity, gravity, rotation vectors, and more</li> <li> <strong>SHA-256 hash</strong> is computed on the JPEG bytes up to the EOI marker</li> <li> <strong>ECDSA P-256</strong> signs the manifest via hardware-ba
Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!