Releases version product interface paper github

I Built Consistent Hashing From Scratch in Go — Here's What I Learned

DEV Communityby Mustafa Veysi SoyvuralApril 2, 20266 min read1 views

If you've ever added a server to a cache cluster and watched your database melt, you already know the problem consistent hashing solves. You just might not know it by name. I built a full implementation from scratch in Go to understand it deeply. This post walks through what I learned — the problem, the fix, and the gotchas nobody tells you about. The five-minute version You have 5 cache servers. You route keys with hash(key) % 5 . Life is good. Then traffic spikes and you add a 6th server. Now it's hash(key) % 6 . Sounds harmless, right? Here's what actually happens: Before: hash("user:1001") % 5 = 3 → Server C After: hash("user:1001") % 6 = 1 → Server A ← moved! That key was sitting happily on Server C. Now every client thinks it's on Server A, where it doesn't exist. Cache miss. The req

If you've ever added a server to a cache cluster and watched your database melt, you already know the problem consistent hashing solves. You just might not know it by name.

I built a full implementation from scratch in Go to understand it deeply. This post walks through what I learned — the problem, the fix, and the gotchas nobody tells you about.

The five-minute version

You have 5 cache servers. You route keys with hash(key) % 5. Life is good.

Then traffic spikes and you add a 6th server. Now it's hash(key) % 6. Sounds harmless, right?

Here's what actually happens:

Before: hash("user:1001") % 5 = 3 → Server C After: hash("user:1001") % 6 = 1 → Server A ← moved!

Before: hash("user:1001") % 5 = 3 → Server C After: hash("user:1001") % 6 = 1 → Server A ← moved!

Enter fullscreen mode

Exit fullscreen mode

That key was sitting happily on Server C. Now every client thinks it's on Server A, where it doesn't exist. Cache miss. The request hits your database.

And it's not just one key. ~83% of all keys remap. With 100K cached items, that's 83,000 simultaneous cache misses hammering your database. People call this a "thundering herd" or "cache stampede." I call it a bad Monday.

What consistent hashing actually does

Instead of hash % N, imagine putting everything on a circle — a ring from 0 to 2^32.

Nodes get hashed onto the ring. Keys get hashed onto the ring. To find which node owns a key, you start at the key's position and walk clockwise until you bump into a node.

0  │  Node-C ──┤  │  2^32 ────────┼──────── 2^8  │  ├── "user:1001" → walks clockwise → Node-A  │  Node-A ──┤  │  2^16  │  Node-B ──┤

0  │  Node-C ──┤  │  2^32 ────────┼──────── 2^8  │  ├── "user:1001" → walks clockwise → Node-A  │  Node-A ──┤  │  2^16  │  Node-B ──┤

Enter fullscreen mode

Exit fullscreen mode

When you add a node, it only takes over the arc between itself and the previous node. Everything else stays put. The math works out to roughly K/N keys moving (K = total keys, N = new node count). So with 100K keys and 6 nodes, about 17K keys move instead of 83K.

I measured this in my implementation:

Strategy │ Keys Remapped (5→6 nodes, 100K keys) ──────────────────│───────────────────────────────────── Modulo (hash%N) │ 83,803 (83.8%) Consistent Hash │ 15,723 (15.7%)

Strategy │ Keys Remapped (5→6 nodes, 100K keys) ──────────────────│───────────────────────────────────── Modulo (hash%N) │ 83,803 (83.8%) Consistent Hash │ 15,723 (15.7%)

Enter fullscreen mode

Exit fullscreen mode

That's not a marginal improvement. That's the difference between "the site went down" and "nobody noticed."

The virtual node trick

There's a catch with the basic approach. If you have 5 nodes, you have 5 points on the ring. And 5 points on a circle don't spread evenly — some nodes end up owning huge arcs while others get almost nothing.

I measured this too. With 5 nodes and 50K keys, one node got 32,014 keys while another got 1,855. That's a 17:1 ratio. Terrible.

The fix is virtual nodes. Instead of putting each physical node at one position, you put it at 150 positions. Node-A becomes Node-A#0, Node-A#1, ... Node-A#149, each hashed to a different spot on the ring.

Here's what that does to the distribution:

Virtual Nodes Std Deviation Worst Node (ratio to fair share)

1 11,353 3.20x

50 4,601 1.83x

150 2,824 1.47x

500 976 1.17x

150 is the sweet spot most production systems use. Good enough distribution, not too much memory overhead.

What happens when a node dies

This is where it gets interesting for real-world systems.

I built a cache simulator on top of the hash ring — basically a miniature distributed Memcached. Five "Redis nodes," 10,000 keys. Then I killed one.

redis-1 2,197 keys ██████████████████████ redis-2 1,731 keys █████████████████ redis-3 1,559 keys ███████████████ redis-4 1,730 keys █████████████████ redis-5 2,783 keys ████████████████████████████

redis-1 2,197 keys ██████████████████████ redis-2 1,731 keys █████████████████ redis-3 1,559 keys ███████████████ redis-4 1,730 keys █████████████████ redis-5 2,783 keys ████████████████████████████

redis-3 goes down!

Cache hit rate: 84.4% Cache misses: 1,559 (exactly the keys redis-3 had)`

Enter fullscreen mode

Exit fullscreen mode

Exactly 1,559 misses — the precise number of keys that were on redis-3. Zero keys from other nodes were affected. That's the whole point: the blast radius of a failure is limited to the failed node.

In production, you'd replicate each key to the next 2-3 nodes on the ring, so even that node's keys survive. But that's a story for another post.

Things I got wrong the first time

Hash collisions between virtual nodes. With 150 vnodes per physical node and a 32-bit hash space, collisions are rare but not impossible. My first version silently overwrote one node's position with another's, which corrupted the ring in subtle ways. The fix: check before inserting and skip collisions.

Locking granularity. The ring needs a sync.RWMutex because reads (key lookups) vastly outnumber writes (node additions/removals). My first version used a regular Mutex and paid for it under concurrent load.

Silent error swallowing. When the ring is empty, GetNode returns an error. My first version of the cache simulator just ignored that error and kept going, silently dropping writes. In a real system, that's how you get data loss that takes hours to diagnose.

Testing the hard stuff

Unit tests are straightforward. The interesting part is testing the mathematical guarantees under chaos.

I wrote five "catastrophic tests" that simulate production nightmares:

Mass failure: Remove 5 of 10 nodes simultaneously. Verify that only the dead nodes' keys remapped. Result: 0 keys from surviving nodes moved. Perfect isolation.

Rapid churn: 20 random add/remove cycles. Each individual operation should remap roughly K/N keys. Result: every cycle stayed within 1.5x of the theoretical bound.

Concurrent chaos: 20 goroutines doing lookups while 5 goroutines add and remove nodes for 2 seconds straight, with Go's race detector on. Result: 2.4M reads, zero data races, 0% error rate.

These aren't pass/fail tests — they're property-based tests that verify the algorithm's mathematical guarantees hold under pressure.

The implementation

The full code is on GitHub: soyvural/consistent-hashing

It's about 800 lines of Go with zero external dependencies. The core is a sorted slice of uint32 positions with binary search for O(log n) lookups. Hash function is pluggable via an interface — ships with FNV-1a, MD5, and CRC32.

# try it yourself git clone https://github.com/soyvural/consistent-hashing.git cd consistent-hashing go run cmd/demo/main.go

# try it yourself git clone https://github.com/soyvural/consistent-hashing.git cd consistent-hashing go run cmd/demo/main.go

Enter fullscreen mode

Exit fullscreen mode

The demo prints all the numbers I quoted in this post. You can tweak the vnode count, swap hash functions, and see exactly how the distribution changes.

Where consistent hashing shows up

Once you know what to look for, it's everywhere:

Memcached clients use it to route keys to servers
Amazon DynamoDB uses it for data partitioning (described in their 2007 Dynamo paper)
Apache Cassandra uses it for its token ring
Load balancers use it for sticky sessions that survive backend scaling

The common thread: any time you need to spread work across a changing set of workers without reshuffling everything.

More about

versionproductinterface

ModelsLive

Sources: Mark Zuckerberg is back to writing code after a two-decade hiatus, submitting three diffs to Meta s monorepo, and is a heavy user of Claude Code CLI (Gergely Orosz/The Pragmatic Engineer)

Gergely Orosz / The Pragmatic Engineer : Sources: Mark Zuckerberg is back to writing code after a two-decade hiatus, submitting three diffs to Meta's monorepo, and is a heavy user of Claude Code CLI Mark Zuckerberg and Garry Tan join the trend of C-level folks jumping back into coding with AI. Also: a bad week for Claude Code and GitHub, and more

Techmeme

1mabout 2 hours ago

ModelsFresh

Distributed 1-bit LLM inference over P2P - 50 nodes validated, 100% shard discovery, CPU-only

There are roughly 4 billion CPUs on Earth. Most of them sit idle 70% of the time. Meanwhile, the AI industry is burning $100B+ per year on GPU clusters to run models that 95% of real-world tasks don't actually need. ARIA Protocol is an attempt to flip that equation. It's a peer-to-peer distributed inference system built specifically for 1-bit quantized models (ternary weights: -1, 0, +1). No GPU. No cloud. No central server. Nodes discover each other over a Kademlia DHT, shard model layers across contributors, and pipeline inference across the network. Think Petals meets BitNet, minus the GPU requirement. This isn't Ollama or llama.cpp — those are great tools, but they're single-machine. ARIA distributes inference across multiple CPUs over the internet so that no single node needs to hold

Reddit r/LocalLLaMA

3mabout 3 hours ago

ModelsFresh

Closed model providers change behavior between API versions with no real changelog. Building anything on top of them is a gamble.

This is one of the reasons I keep gravitating back to local models even when the closed API ones are technically stronger. I had a production pipeline running on a major closed API for about four months. Stable, tested, working. Then one day the outputs started drifting. Not breaking errors, just subtle behavioral changes. Format slightly different, refusals on things it used to handle fine, confidence on certain task types quietly degraded. No changelog. No notification. Support ticket response was essentially "models are updated periodically to improve quality." There is no way to pin to a specific checkpoint. You signed up for a service that reserves the right to change what the service does at any time. The thing that gets me is how normalized this is. If a database provider silently c

Reddit r/LocalLLaMA

2mabout 3 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 180 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Releases

ReleasesFresh

Apple releases iOS 26.5 beta: What's new this time - Mashable

Apple releases iOS 26.5 beta: What's new this time Mashable

GNews AI Apple

1mabout 10 hours ago

ReleasesLive

AI APIs That Simplify Complex Features

Plug-and-play intelligence. Continue reading on Artificial Intelligence in Plain English »

Medium AI

1mabout 2 hours ago

ReleasesLive

Oracle Database 26ai: The World’s First AI-Native Database Just Changed Everything

This is not an update. This is not a feature release. Oracle AI Database 26ai is a complete reimagining of what a database is — and every… Continue reading on Medium »

Medium AI

1mabout 2 hours ago

ReleasesLive

OpenClaw CVE-2026-33579: Unauthorized Privilege Escalation via `/pair approve` Command Fixed

CVE-2026-33579: A Critical Analysis of OpenClaw’s Authorization Collapse The recently disclosed CVE-2026-33579 vulnerability in OpenClaw represents a catastrophic failure in its authorization framework, enabling trivial full instance takeovers. At the core of this issue lies the /pair approve command—a mechanism intended for secure device registration that, due to a fundamental design flaw, bypasses critical authorization checks. This analysis dissects the vulnerability’s root cause, exploitation process, and systemic failures, underscoring the urgency of patching and the ease of attack. Root Cause: Authorization Bypass via Implicit Trust OpenClaw’s pairing system is designed to facilitate temporary, low-privilege access for device registration. The /pair approve command, however, omits ex

DEV Community

10mabout 2 hours ago

I Built Consistent Hashing From Scratch in Go — Here's What I Learned

The five-minute version

What consistent hashing actually does

The virtual node trick

What happens when a node dies

Things I got wrong the first time

Testing the hard stuff

The implementation

Where consistent hashing shows up

Further reading

Daily AI Digest

More about

Sources: Mark Zuckerberg is back to writing code after a two-decade hiatus, submitting three diffs to Meta s monorepo, and is a heavy user of Claude Code CLI (Gergely Orosz/The Pragmatic Engineer)

Distributed 1-bit LLM inference over P2P - 50 nodes validated, 100% shard discovery, CPU-only

Closed model providers change behavior between API versions with no real changelog. Building anything on top of them is a gamble.

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Releases

Apple releases iOS 26.5 beta: What's new this time - Mashable

AI APIs That Simplify Complex Features

Oracle Database 26ai: The World’s First AI-Native Database Just Changed Everything

OpenClaw CVE-2026-33579: Unauthorized Privilege Escalation via `/pair approve` Command Fixed