The Parallel Lanes Nobody Uses
<h1> The Parallel Lanes Nobody Uses </h1> <h2> SIMD and the Eight-Lane Highway You've Been Driving Solo </h2> <p><em>Reading time: ~13 minutes</em></p> <p>You ran ripgrep across a 2GB log file and it finished in half a second. grep would have taken ten. You called <code>np.array * 2</code> and it finished before the function call overhead had time to register.</p> <p>Here's what actually happened: your CPU has 256-bit registers that can process 8 floats simultaneously. Those tools used all eight lanes of an eight-lane highway. Your Python for-loop uses one.</p> <p>This is what your CPU can actually do.</p> <h2> The Fundamental Idea </h2> <p><strong>SIMD</strong> stands for Single Instruction, Multiple Data. It's not a clever trick. It's a first-class feature of every CPU you've used in the
The Parallel Lanes Nobody Uses
SIMD and the Eight-Lane Highway You've Been Driving Solo
Reading time: ~13 minutes
You ran ripgrep across a 2GB log file and it finished in half a second. grep would have taken ten. You called np.array * 2 and it finished before the function call overhead had time to register.*
Here's what actually happened: your CPU has 256-bit registers that can process 8 floats simultaneously. Those tools used all eight lanes of an eight-lane highway. Your Python for-loop uses one.
This is what your CPU can actually do.
The Fundamental Idea
SIMD stands for Single Instruction, Multiple Data. It's not a clever trick. It's a first-class feature of every CPU you've used in the last twenty years.
The idea is direct. A normal CPU instruction operates on one value:
ADD rax, rbx # add one 64-bit integer to one other 64-bit integer
Enter fullscreen mode
Exit fullscreen mode
A SIMD instruction operates on a packed vector of values in a single clock:
VADDPS ymm0, ymm1, ymm2 # add eight 32-bit floats at once
Enter fullscreen mode
Exit fullscreen mode
Eight additions. One instruction. One cycle.
The register ymm0 is 256 bits wide. You pack 8 floats (each 32 bits) into it and treat the whole thing as a single operand. The arithmetic unit is physically wider — eight adders in parallel — and the instruction wires them all to fire simultaneously.
This is not a metaphor. It's silicon.
How We Got Here: The Register Zoo
The story of SIMD is a story of Intel and AMD racing to add bigger and bigger registers while pretending backward compatibility wasn't getting worse.
MMX (1996) — Intel introduced the first SIMD extension in the Pentium MMX. Eight 64-bit registers (mm0–mm7) for integer operations. The catch: those registers were aliased to the mantissa fields of the x87 ST(0)–ST(7) floating-point registers. Switching between MMX and x87 FP required executing EMMS to reset the x87 tag word first. (I'm simplifying the aliasing here — the full story involves how x87 tracks "empty" register slots.) Programmers used it. Suffered for it. Moved on.
SSE (1999) — Streaming SIMD Extensions. Eight new 128-bit registers (xmm0–xmm7), finally independent of the FPU stack. Supported 4 single-precision floats or integer variants. Used heavily for 3D graphics and audio in the early 2000s.
SSE2 (2001) — Added double-precision floats and 128-bit integer operations. x86-64 made SSE2 mandatory, so as of 64-bit mode you can assume it exists. This is the baseline.
SSE3, SSSE3, SSE4.1, SSE4.2 (2004–2007) — A string of incremental additions. String comparison instructions, dot products, population counts. Useful but baroque. The naming got embarrassing.
AVX (2011) — Intel widened the registers to 256 bits (ymm0–ymm15). Now you could do 8 floats or 4 doubles at once. The ymm registers are actually the full-width versions of the xmm registers — xmm0 is the lower 128 bits of ymm0.
AVX2 (2013) — Extended AVX to integer operations and added gather instructions (load scattered values from memory into a vector register). Available on Intel Haswell and later, AMD Ryzen. This is the register set most production code targets today.
AVX-512 (2017) — 512-bit registers (zmm0–zmm31). 16 floats or 8 doubles at once. Intel pushed this hard in server chips; it's common in the data center. Desktop support is inconsistent — Intel disabled AVX-512 on Alder Lake desktop SKUs specifically because AVX-512 instructions are power-hungry enough to trigger thermal throttling, and Alder Lake's big/little core design made the behavior unpredictable. AMD added AVX-512 starting with Zen 4. The instruction set is 300+ pages of documentation.
The registers kept doubling. The theoretical throughput kept doubling. Most application code never noticed.
Why the Compiler Sometimes Does This For You
Modern compilers — GCC, Clang, MSVC, and rustc (which uses LLVM) — can auto-vectorize loops. This is when the compiler looks at your scalar loop and emits SIMD instructions for it without you asking.
This works well when:
-
The loop has no data dependencies between iterations (iteration N doesn't use the result of iteration N-1)
-
The data is contiguous in memory (array, not linked list)
-
The compiler can prove there's no aliasing (the input and output arrays don't overlap)
-
The trip count is known or the compiler can generate a scalar fallback for the remainder
A simple sum-of-squares is a textbook case the compiler handles automatically:
pub fn sum_squares(a: &[f32]) -> f32 { a.iter().map(|x| x * x).sum() }pub fn sum_squares(a: &[f32]) -> f32 { a.iter().map(|x| x * x).sum() }Enter fullscreen mode
Exit fullscreen mode
Compile with --release targeting AVX2 and... the multiply vectorizes (vmulps) but the sum stays scalar (vaddss). Wait, what?
Floating-point addition isn't associative — (a + b) + c can give a different result from a + (b + c) due to rounding. The compiler won't reorder your additions without permission, which means it can't pack 8 sums into a single vaddps. Switch to integers and the story changes:
pub fn sum_squares_i32(a: &[i32]) -> i32 { a.iter().map(|x| x * x).sum() }pub fn sum_squares_i32(a: &[i32]) -> i32 { a.iter().map(|x| x * x).sum() }Enter fullscreen mode
Exit fullscreen mode
Now you get vpmulld and vpaddd on ymm registers — 8 integers at once, fully vectorized. Integer addition is associative, so LLVM can reorder freely. See both versions side by side on Compiler Explorer →
This is the kind of thing that makes auto-vectorization both powerful and frustrating. The compiler is doing the right thing — it won't change your program's semantics — but it means the "just write clean code and the compiler will vectorize it" advice has a large asterisk on it.
This breaks down further the moment things get complicated. Add a branch inside the loop: the compiler has to use masked operations or give up. Use a data structure it can't prove is contiguous: it has to generate both a vectorized path and a scalar fallback, with a runtime check. Access non-contiguous memory: it has to use gather instructions, which are slower than you'd hope. Add any function call it can't inline: it bails entirely.
Rust's ownership model actually helps here — slices guarantee contiguous memory and the borrow checker proves non-aliasing at compile time. That's information the auto-vectorizer can use. In C, the compiler has to assume two float* arguments might alias unless you annotate with restrict.*
The compiler's auto-vectorizer is optimistic but conservative. You can inspect the emitted SIMD with cargo rustc --release -- --emit asm, or use Compiler Explorer to see exactly what LLVM generated. Read that output. It's educational in a way that is sometimes painful.
Intrinsics: Taking the Wheel
When auto-vectorization isn't enough, you can write SIMD code directly using intrinsics — functions in Rust's std::arch module that map one-to-one to specific CPU instructions.
This is not assembly. You're still writing Rust. You're just telling the compiler exactly which instruction to emit. The ISA-specific code lives inside unsafe blocks, making it explicit where you're stepping outside the compiler's guarantees:
#[cfg(target_arch = "x86_64")] use std::arch::x86_64::*;*#[cfg(target_arch = "x86_64")] use std::arch::x86_64::*;*/// Add two float slices element-wise using AVX. /// Handles lengths that aren't a multiple of 8 with a scalar tail. #[target_feature(enable = "avx")] unsafe fn add_arrays(a: &[f32], b: &[f32], out: &mut [f32]) { let n = a.len().min(b.len()).min(out.len()); let mut i = 0; while i + 8 <= n { let va = _mm256_loadu_ps(a.as_ptr().add(i)); // load 8 floats let vb = _mm256_loadu_ps(b.as_ptr().add(i)); // load 8 floats let vc = _mm256_add_ps(va, vb); // add all 8 _mm256_storeu_ps(out.as_mut_ptr().add(i), vc); // store 8 floats i += 8; } // scalar tail for remainder (if n % 8 != 0) for j in i..n { out[j] = a[j] + b[j]; } }`
Enter fullscreen mode
Exit fullscreen mode
The __m256 type is a 256-bit vector. _mm256_loadu_ps loads 8 unaligned single-precision floats. _mm256_add_ps adds them. One call, one instruction. The #[target_feature(enable = "avx")] attribute tells the compiler this function requires AVX — calling it on hardware without AVX is undefined behavior, which is why the function is unsafe.
Intrinsics code is not fun to write. The naming convention (_mm256_loadu_ps vs _mm256_load_ps vs mm512_loadu_ps) requires memorizing a taxonomy. The Intel Intrinsics Guide (at intrinsics.intel.com) is the reference — it lists every intrinsic, the instruction it maps to, the latency, and the throughput. You'll spend time there.
The upside over C: Rust's type system catches width mismatches at compile time. If you accidentally pass an __m128 where an __m256 is expected, that's a type error, not a silent runtime bug. The unsafe boundary also makes it easy to audit — every line that touches raw SIMD is visually contained.
For a higher-level alternative, Rust's portable SIMD API (std::simd) provides type-safe, architecture-independent vector types like f32x8. It's available on nightly and progressing toward stable. When it lands, it will be the preferred way to write explicit SIMD without unsafe or platform-specific intrinsics.
Most application programmers don't write intrinsics. But the programmers who write the libraries you depend on — numpy, simdjson, ripgrep — absolutely do.
Where SIMD Actually Lives
String Search
Finding a byte in a buffer. You do it constantly, you never think about it, and it's the single operation where SIMD makes the most visceral difference. A naive loop checks one byte at a time. SIMD checks 32 with a single mm256_cmpeq_epi8 — compare 32 bytes simultaneously, get a 32-bit mask of which positions matched.
memchr — the fundamental byte-search operation — is implemented with SIMD at every level: glibc's C implementation, and Rust's memchr crate (which we'll get to in a moment). The function you call every day is already vectorized.
ripgrep is fast partly because of SIMD-accelerated memchr. The memchr crate by Andrew Gallant implements memchr, memmem, and substring search using AVX2 (and AVX-512 where available). The core idea for substring search is Teddy — an algorithm that uses SIMD to find candidate positions in bulk, then verifies them. When ripgrep is blazing through a 2GB log file, it's pushing 32 bytes at a time through vectorized comparisons. This is why it outperforms grep by 5–10x on many workloads. It's not magic. It's lanes.
That's also why string search benchmarks look bizarre to anyone who hasn't seen SIMD before. A loop that calls find in a hot path and a SIMD-accelerated version can differ by 8x with identical O() complexity. The algorithm doesn't tell you the constant factor.
JSON Parsing
In 2019 Daniel Lemire wrote a whitepaper which proved that JSON parsing is fundamentally a SIMD problem, giving birth to simdjson. The bottleneck in parsing isn't the logic — it's scanning through bytes looking for structural characters ({, }, [, ], :, ,, ").
simdjson processes 64 bytes at a time using AVX-512 (or 32 with AVX2). It classifies every byte simultaneously — is this a structural character? A whitespace? A quote? — using bitwise SIMD operations to produce bitmasks. Then it uses those bitmasks to drive parsing without a byte-at-a-time loop.
The result: simdjson parses JSON at 2–3 GB/s on a modern CPU. The fastest pure-scalar parser does maybe 300–500 MB/s. The 6x difference is entirely SIMD.
That's why simdjson exists. That's why it's in MongoDB, Clickhouse, and dozens of other systems that care about throughput.
Image Processing
Every pixel is independent. Every channel is independent. This is SIMD's dream workload — no data dependencies, no branches, just arithmetic on contiguous arrays of bytes. SSE2 processes 16 pixels at once with saturating addition (u8x16::saturating_add in portable SIMD). OpenCV, libjpeg-turbo, libpng — they all have SIMD paths for their hot loops. When Photoshop applies a filter to a 24-megapixel image in under a second, this is why.
ML Inference
This is the one that matters most right now.
Neural network inference is fundamentally matrix multiplication: take a weight matrix, multiply by an input vector, pass through an activation function. Repeat. The core operation — multiply-accumulate on large matrices — is exactly what SIMD was built for.
AVX2's fused multiply-add (mm256_fmadd_ps via std::arch, or f32x8::mul_add in portable SIMD) does ab + c on 8 floats in one instruction. For a naive matrix multiply loop, this is an 8x multiplier before you've thought about anything else. Add tiling for cache efficiency and you're in the range of what high-performance BLAS libraries actually do.
AVX-512 with VNNI (Vector Neural Network Instructions, 2019) goes further — it adds instructions specifically for quantized integer dot products used in 8-bit inference. A single vpdpbusd instruction (exposed as mm512_dpbusd_epi32 in intrinsics) processes 16 multiply-accumulates in one clock. llama.cpp, the library that lets you run large language models on consumer hardware, has hand-written AVX2 and AVX-512 kernels for its matrix multiplication. When you run a local model on your laptop, those kernels are running in tight loops for every token you generate.
The Mindset Shift
Here's the insight that changes how you write code even if you never touch an intrinsic.
SIMD forces you to think in batches, not items.
Scalar code says: "for each element, do this." SIMD code says: "take 8 elements, do this to all of them at once, advance 8." The data structure implications are real.
Arrays of Structures vs Structures of Arrays
Consider a particle system. You might model it like this:
struct Particle { x: f32, y: f32, z: f32, // position vx: f32, vy: f32, vz: f32, // velocity mass: f32, } let particles: Vec = Vec::with_capacity(1_000_000);struct Particle { x: f32, y: f32, z: f32, // position vx: f32, vy: f32, vz: f32, // velocity mass: f32, } let particles: Vec = Vec::with_capacity(1_000_000);Enter fullscreen mode
Exit fullscreen mode
This is AoS — Array of Structures. Each particle's data is packed together. Intuitive. Natural.
The goal: update all x positions — x += vx * dt — for every particle.*
The problem: x and vx are separated by 24 bytes in each struct. When you load a SIMD vector of 8 x values, you also pull in y, z, vx, vy, vz, mass — data you don't need. Your cache lines are full of noise. Your SIMD registers require a scatter-gather to populate.
The SIMD-friendly layout is SoA — Structure of Arrays:
struct Particles { x: Vec, y: Vec, z: Vec, vx: Vec, // ... }struct Particles { x: Vec, y: Vec, z: Vec, vx: Vec, // ... }Enter fullscreen mode
Exit fullscreen mode
With SoA, all x values are contiguous. Loading &particles.x[i..i+8] gives 8 consecutive x values, ready to go. Loading &particles.vx[i..i+8] gives the matching 8 vx values. One fused multiply-add updates 8 particles. No scatter-gather. No cache waste.
This is not a micro-optimization. The difference in a physics simulation inner loop can be 4–8x. The code is otherwise identical.
That's why SoA and AoS matter — two data structures with identical asymptotic behavior, identical logical content, identical algorithmic logic. One is auto-vectorizable. One isn't. The difference is 8x. Nobody mentioned this in algorithms class.
This also explains why entity-component systems (ECS) — used in game engines like Unity DOTS and Bevy — look structurally odd until you see SIMD. ECS stores component data in contiguous arrays per component type, not per entity. That's SoA. The performance difference for physics and animation simulations is why the pattern exists.
Alignment
SIMD instructions have opinions about memory alignment. Aligned loads — _mm256_load_ps — require the address to be 32-byte aligned (the address mod 32 == 0). Unaligned loads — _mm256_loadu_ps — work on any address, but may be slower on older hardware.
On modern CPUs (Intel Skylake and later, AMD Zen 2 and later), unaligned loads are as fast as aligned loads — as long as you don't cross a 64-byte cache line boundary. So alignment mostly solves itself if you enforce it on your arrays and use mm256_loadu_ps in your code.
In Rust, you control alignment with #[repr(align(32))]:
#[repr(C, align(32))] struct AlignedBlock { data: [f32; 8], }#[repr(C, align(32))] struct AlignedBlock { data: [f32; 8], }Enter fullscreen mode
Exit fullscreen mode
This is the equivalent of C's attribute((aligned(32))) or alignas(32). It means: "I plan to load this with SIMD and I want the first element to be register-friendly."
You Don't Need to Write Intrinsics
The practical message is not "go rewrite your code in intrinsics." It's shorter:
Write in a way the compiler can vectorize. Keep your hot loops simple and branch-free. Lay your data out contiguously in the access order you need it. Prefer SoA over AoS in performance-critical code. Reach for libraries (numpy, simdjson, BLAS, any vectorized BLAS-backed ML framework) before reaching for intrinsics.
That's why numpy is fast and a Python for-loop isn't. numpy's inner loops are SIMD-vectorized C. When you call arr * 2, numpy dispatches to a vectorized multiply kernel operating on the entire array in chunks of 8 or 16 elements. Your Python for-loop multiplies one element per bytecode interpretation cycle.*
Understand that when two seemingly equivalent implementations have an 8x performance difference, this is frequently why. Not cache (though that's related). Not branch prediction (though that matters too). The data layout didn't allow the CPU to use seven of its eight lanes.
If you do need explicit SIMD, Rust gives you options before you reach for raw intrinsics:
-
std::simd — Rust's portable SIMD API (nightly, progressing toward stable). Type-safe vector types like f32x8 that compile to the best available instructions on any architecture. This is the future.
-
wide — a stable crate providing portable SIMD types today. Good for production code that can't wait for std::simd.
-
pulp — runtime CPU feature detection with safe SIMD dispatch.
For C++ codebases, highway (Google's portable SIMD abstraction) serves a similar role. Don't write raw mm256* calls unless you've exhausted the higher-level options — though in Rust, at least the type system will catch width mismatches at compile time instead of letting you discover them at midnight.*
What the CPU Looks Like Now
One instruction: ADD rax, rbx → adds two 64-bit integers → uses 64 bits of register spaceOne instruction: ADD rax, rbx → adds two 64-bit integers → uses 64 bits of register spaceOne SIMD instruction: VADDPS ymm0, ymm1, ymm2 → adds eight 32-bit floats → uses 256 bits of register space → eight physical adders firing simultaneously
Your loop over 8 million floats: Scalar: 8,000,000 add instructions AVX2: 1,000,000 add instructions (8x fewer) AVX-512: 500,000 add instructions (16x fewer)`
Enter fullscreen mode
Exit fullscreen mode
The lanes are there. They've been there since 1999, getting wider every few years. Every calculation you've ever run in a Python loop touched one lane of a machine that had eight available.
Further Reading
-
Intel Intrinsics Guide — The reference. Every intrinsic, its instruction, latency, and throughput. Searchable by operation type. Directly maps to Rust's std::arch function names.
-
Rust std::simd tracking issue — The portable SIMD API's path to stabilization. Good overview of the design and current status.
-
std::arch module docs — Rust's platform intrinsics. Every mm256* function from the Intel guide has a corresponding Rust binding here.
-
memchr crate (Rust) — Andrew Gallant's SIMD-accelerated byte/substring search. Read the source and the README for a clear explanation of the Teddy algorithm.
-
wide crate — Portable SIMD types on stable Rust. A practical alternative while std::simd stabilizes.
-
simdjson paper — Lemire et al., 2019. "Parsing Gigabytes of JSON per Second." The original paper. Section 3 explains the SIMD classification step.
-
"What Every Programmer Should Know About Memory" — Ulrich Drepper — Section 6 covers SIMD and its interaction with the cache hierarchy. This was the reference when AVX didn't exist yet; the principles are unchanged.
-
Agner Fog's optimization manuals — Table of instruction latencies and throughputs for every SIMD instruction on every microarchitecture. Dense. Invaluable if you're actually tuning.*
I'm writing a book about what makes developers irreplaceable in the age of AI. Join the early access list →
Naz Quadri once hand-wrote AVX2 intrinsics for a function the Rust compiler had already vectorised better. He blogs at nazquadri.dev. Rabbit holes all the way down 🐇🕳️.
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
llamamodellanguage model
5 Steps to break free from alert fatigue and build resilient security operations
How many times has your SOC hit crisis mode at 2:00 AM, with the dashboard blaring red and analysts scrambling to separate real threats from useless noise? We’ve all been there, and if you’re still measuring success by the number of alerts closed, chances are you’re feeling the strain. The truth is, responding to everything is neither sustainable nor effective—and it puts resilience at risk. In this article, we’ll show you the five most important steps you can take to move from alert fatigue to business resilience, supported by hard data from the 2026 N-able State of the SOC Report . These are the practical habits security-driven IT leaders are adopting to future-proof their operations and protect what matters most. 1. Recognize the cost of noise: When “more alerts” means more risk Many SO
My most common advice for junior researchers
Written quickly as part of the Inkhaven Fellowship . At a high level, research feedback I give to more junior research collaborators often can fall into one of three categories: Doing quick sanity checks Saying precisely what you want to say Asking why one more time In each case, I think the advice can be taken to an extreme I no longer endorse. Accordingly, I’ve tried to spell out the degree to which you should implement the advice, as well as what “taking it too far” might look like. This piece covers doing quick sanity checks, which is the most common advice I give to junior researchers. I’ll cover the other two pieces of advice in a subsequent piece. Doing quick sanity checks Research is hard (almost by definition) and people are often wrong. Every researcher has wasted countless hours
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Products

5 critical steps to achieve business resilience in cybersecurity
What does it really take to keep your organization running when attackers strike? The answer is business resilience—being able to detect, contain, and recover fast enough that disruptions are minimized, customers stay confident, and operations keep moving. From the latest 2026 State of the SOC Report , which is based on more than 900,000 alerts observed between March and December 2025 from the Adlumin Managed Detection and Response (MDR) provided by the N-able SOC, we’ve seen firsthand where security strategies succeed—and where they fall short. Below, we break down five actionable ways to build true resilience for your IT environment, using real-world data, strategic guidance, and frameworks that leading IT teams put into practice today. 1. Stop trusting single-layer security If you’re de

6 metrics IT leaders can’t afford to ignore for business resilience
If you’re in IT, you know: what we don’t measure puts business resilience at risk. In the face of rising threat volumes, scaling complexity, and board-level scrutiny, tracking the right operational metrics isn’t just about visibility—it’s the foundation for proactive risk management and business continuity. Compliance and insurance demands are also driving the scrutiny around measuring cybersecurity programs. Recent findings from the 2026 N-able State of the SOC Report are clear: the threat landscape keeps shifting, automation and integration are now must-haves, and organizations delivering true resilience measure what matters most. Below are the six metrics that we use to move the needle from firefighting to futureproofing. 1. Mean time to detect (MTTD): The speed of awareness Attackers a

5 essential steps to bulletproof your endpoint security (and avoid the biggest mistakes)
Business resilience starts at the endpoint. Between March and December 2025, the N-able SOC processed over 900,000 alerts—and a staggering 18% originated from network and perimeter exploits that most endpoint-only security never saw. Attackers are constantly shifting tactics, and endpoints remain an exposed attack surface. The good news: the right proactive strategies put you in control, stopping threats before they ripple across your business. Here’s our concise, field-tested playbook to operationalize resilient endpoint security and avoid the single-layer fallacy that leaves half your risks unseen. 1. Start with full endpoint visibility—No blind spots allowed You can’t protect what you don’t know about. As mentioned in our State of the SOC report , network and perimeter threats flew unde

6 critical mistakes that undermine cyber resilience (and how to fix them)
Silos are the enemy of business resilience. As IT leaders, we’ve all felt the pain: the backup administrator, SOC analyst, and endpoint engineer operating in separate worlds—often meeting for the first time in the chaos of a live cyberattack. The result? Delayed responses, missed signals, and greater impact on the business. The N-able 2026 State of the SOC Report leaves no doubt. In just one year, 18% of all security alerts came from network and perimeter exploits—risks many endpoint-only teams never saw coming. Even scarier? 50% of attacks completely bypass endpoint controls. You can’t afford to be siloed. Here’s where most organizations go wrong—and the six crucial steps you need to take to align our teams, tools, and processes for true business resilience. Mistake 1: Unclear roles and r


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!