I Built Consistent Hashing From Scratch in Go — Here's What I Learned
If you've ever added a server to a cache cluster and watched your database melt, you already know the problem consistent hashing solves. You just might not know it by name. I built a full implementation from scratch in Go to understand it deeply. This post walks through what I learned — the problem, the fix, and the gotchas nobody tells you about. The five-minute version You have 5 cache servers. You route keys with hash(key) % 5 . Life is good. Then traffic spikes and you add a 6th server. Now it's hash(key) % 6 . Sounds harmless, right? Here's what actually happens: Before: hash("user:1001") % 5 = 3 → Server C After: hash("user:1001") % 6 = 1 → Server A ← moved! That key was sitting happily on Server C. Now every client thinks it's on Server A, where it doesn't exist. Cache miss. The req
If you've ever added a server to a cache cluster and watched your database melt, you already know the problem consistent hashing solves. You just might not know it by name.
I built a full implementation from scratch in Go to understand it deeply. This post walks through what I learned — the problem, the fix, and the gotchas nobody tells you about.
The five-minute version
You have 5 cache servers. You route keys with hash(key) % 5. Life is good.
Then traffic spikes and you add a 6th server. Now it's hash(key) % 6. Sounds harmless, right?
Here's what actually happens:
Before: hash("user:1001") % 5 = 3 → Server C After: hash("user:1001") % 6 = 1 → Server A ← moved!Before: hash("user:1001") % 5 = 3 → Server C After: hash("user:1001") % 6 = 1 → Server A ← moved!Enter fullscreen mode
Exit fullscreen mode
That key was sitting happily on Server C. Now every client thinks it's on Server A, where it doesn't exist. Cache miss. The request hits your database.
And it's not just one key. ~83% of all keys remap. With 100K cached items, that's 83,000 simultaneous cache misses hammering your database. People call this a "thundering herd" or "cache stampede." I call it a bad Monday.
What consistent hashing actually does
Instead of hash % N, imagine putting everything on a circle — a ring from 0 to 2^32.
Nodes get hashed onto the ring. Keys get hashed onto the ring. To find which node owns a key, you start at the key's position and walk clockwise until you bump into a node.
0 │ Node-C ──┤ │ 2^32 ────────┼──────── 2^8 │ ├── "user:1001" → walks clockwise → Node-A │ Node-A ──┤ │ 2^16 │ Node-B ──┤0 │ Node-C ──┤ │ 2^32 ────────┼──────── 2^8 │ ├── "user:1001" → walks clockwise → Node-A │ Node-A ──┤ │ 2^16 │ Node-B ──┤Enter fullscreen mode
Exit fullscreen mode
When you add a node, it only takes over the arc between itself and the previous node. Everything else stays put. The math works out to roughly K/N keys moving (K = total keys, N = new node count). So with 100K keys and 6 nodes, about 17K keys move instead of 83K.
I measured this in my implementation:
Strategy │ Keys Remapped (5→6 nodes, 100K keys) ──────────────────│───────────────────────────────────── Modulo (hash%N) │ 83,803 (83.8%) Consistent Hash │ 15,723 (15.7%)Strategy │ Keys Remapped (5→6 nodes, 100K keys) ──────────────────│───────────────────────────────────── Modulo (hash%N) │ 83,803 (83.8%) Consistent Hash │ 15,723 (15.7%)Enter fullscreen mode
Exit fullscreen mode
That's not a marginal improvement. That's the difference between "the site went down" and "nobody noticed."
The virtual node trick
There's a catch with the basic approach. If you have 5 nodes, you have 5 points on the ring. And 5 points on a circle don't spread evenly — some nodes end up owning huge arcs while others get almost nothing.
I measured this too. With 5 nodes and 50K keys, one node got 32,014 keys while another got 1,855. That's a 17:1 ratio. Terrible.
The fix is virtual nodes. Instead of putting each physical node at one position, you put it at 150 positions. Node-A becomes Node-A#0, Node-A#1, ... Node-A#149, each hashed to a different spot on the ring.
Here's what that does to the distribution:
Virtual Nodes Std Deviation Worst Node (ratio to fair share)
1 11,353 3.20x
50 4,601 1.83x
150 2,824 1.47x
500 976 1.17x
150 is the sweet spot most production systems use. Good enough distribution, not too much memory overhead.
What happens when a node dies
This is where it gets interesting for real-world systems.
I built a cache simulator on top of the hash ring — basically a miniature distributed Memcached. Five "Redis nodes," 10,000 keys. Then I killed one.
redis-1 2,197 keys ██████████████████████ redis-2 1,731 keys █████████████████ redis-3 1,559 keys ███████████████ redis-4 1,730 keys █████████████████ redis-5 2,783 keys ████████████████████████████redis-1 2,197 keys ██████████████████████ redis-2 1,731 keys █████████████████ redis-3 1,559 keys ███████████████ redis-4 1,730 keys █████████████████ redis-5 2,783 keys ████████████████████████████redis-3 goes down!
Cache hit rate: 84.4% Cache misses: 1,559 (exactly the keys redis-3 had)`
Enter fullscreen mode
Exit fullscreen mode
Exactly 1,559 misses — the precise number of keys that were on redis-3. Zero keys from other nodes were affected. That's the whole point: the blast radius of a failure is limited to the failed node.
In production, you'd replicate each key to the next 2-3 nodes on the ring, so even that node's keys survive. But that's a story for another post.
Things I got wrong the first time
Hash collisions between virtual nodes. With 150 vnodes per physical node and a 32-bit hash space, collisions are rare but not impossible. My first version silently overwrote one node's position with another's, which corrupted the ring in subtle ways. The fix: check before inserting and skip collisions.
Locking granularity. The ring needs a sync.RWMutex because reads (key lookups) vastly outnumber writes (node additions/removals). My first version used a regular Mutex and paid for it under concurrent load.
Silent error swallowing. When the ring is empty, GetNode returns an error. My first version of the cache simulator just ignored that error and kept going, silently dropping writes. In a real system, that's how you get data loss that takes hours to diagnose.
Testing the hard stuff
Unit tests are straightforward. The interesting part is testing the mathematical guarantees under chaos.
I wrote five "catastrophic tests" that simulate production nightmares:
Mass failure: Remove 5 of 10 nodes simultaneously. Verify that only the dead nodes' keys remapped. Result: 0 keys from surviving nodes moved. Perfect isolation.
Rapid churn: 20 random add/remove cycles. Each individual operation should remap roughly K/N keys. Result: every cycle stayed within 1.5x of the theoretical bound.
Concurrent chaos: 20 goroutines doing lookups while 5 goroutines add and remove nodes for 2 seconds straight, with Go's race detector on. Result: 2.4M reads, zero data races, 0% error rate.
These aren't pass/fail tests — they're property-based tests that verify the algorithm's mathematical guarantees hold under pressure.
The implementation
The full code is on GitHub: soyvural/consistent-hashing
It's about 800 lines of Go with zero external dependencies. The core is a sorted slice of uint32 positions with binary search for O(log n) lookups. Hash function is pluggable via an interface — ships with FNV-1a, MD5, and CRC32.
# try it yourself git clone https://github.com/soyvural/consistent-hashing.git cd consistent-hashing go run cmd/demo/main.go# try it yourself git clone https://github.com/soyvural/consistent-hashing.git cd consistent-hashing go run cmd/demo/main.goEnter fullscreen mode
Exit fullscreen mode
The demo prints all the numbers I quoted in this post. You can tweak the vnode count, swap hash functions, and see exactly how the distribution changes.
Where consistent hashing shows up
Once you know what to look for, it's everywhere:
-
Memcached clients use it to route keys to servers
-
Amazon DynamoDB uses it for data partitioning (described in their 2007 Dynamo paper)
-
Apache Cassandra uses it for its token ring
-
Load balancers use it for sticky sessions that survive backend scaling
The common thread: any time you need to spread work across a changing set of workers without reshuffling everything.
Further reading
If you want to go deeper:
-
Karger et al. (1997) — the original paper that started it all
-
Amazon's Dynamo paper (2007) — consistent hashing at massive scale
-
Jump Consistent Hash (2014) — a faster alternative when your nodes don't need names
The full source code is at github.com/soyvural/consistent-hashing. Stars welcome if you found this useful.
DEV Community
https://dev.to/veysi/i-built-consistent-hashing-from-scratch-in-go-heres-what-i-learned-24pjSign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
versionproductinterface
Sources: Mark Zuckerberg is back to writing code after a two-decade hiatus, submitting three diffs to Meta s monorepo, and is a heavy user of Claude Code CLI (Gergely Orosz/The Pragmatic Engineer)
Gergely Orosz / The Pragmatic Engineer : Sources: Mark Zuckerberg is back to writing code after a two-decade hiatus, submitting three diffs to Meta's monorepo, and is a heavy user of Claude Code CLI Mark Zuckerberg and Garry Tan join the trend of C-level folks jumping back into coding with AI. Also: a bad week for Claude Code and GitHub, and more

Distributed 1-bit LLM inference over P2P - 50 nodes validated, 100% shard discovery, CPU-only
There are roughly 4 billion CPUs on Earth. Most of them sit idle 70% of the time. Meanwhile, the AI industry is burning $100B+ per year on GPU clusters to run models that 95% of real-world tasks don't actually need. ARIA Protocol is an attempt to flip that equation. It's a peer-to-peer distributed inference system built specifically for 1-bit quantized models (ternary weights: -1, 0, +1). No GPU. No cloud. No central server. Nodes discover each other over a Kademlia DHT, shard model layers across contributors, and pipeline inference across the network. Think Petals meets BitNet, minus the GPU requirement. This isn't Ollama or llama.cpp — those are great tools, but they're single-machine. ARIA distributes inference across multiple CPUs over the internet so that no single node needs to hold

Closed model providers change behavior between API versions with no real changelog. Building anything on top of them is a gamble.
This is one of the reasons I keep gravitating back to local models even when the closed API ones are technically stronger. I had a production pipeline running on a major closed API for about four months. Stable, tested, working. Then one day the outputs started drifting. Not breaking errors, just subtle behavioral changes. Format slightly different, refusals on things it used to handle fine, confidence on certain task types quietly degraded. No changelog. No notification. Support ticket response was essentially "models are updated periodically to improve quality." There is no way to pin to a specific checkpoint. You signed up for a service that reserves the right to change what the service does at any time. The thing that gets me is how normalized this is. If a database provider silently c
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Releases


OpenClaw CVE-2026-33579: Unauthorized Privilege Escalation via `/pair approve` Command Fixed
CVE-2026-33579: A Critical Analysis of OpenClaw’s Authorization Collapse The recently disclosed CVE-2026-33579 vulnerability in OpenClaw represents a catastrophic failure in its authorization framework, enabling trivial full instance takeovers. At the core of this issue lies the /pair approve command—a mechanism intended for secure device registration that, due to a fundamental design flaw, bypasses critical authorization checks. This analysis dissects the vulnerability’s root cause, exploitation process, and systemic failures, underscoring the urgency of patching and the ease of attack. Root Cause: Authorization Bypass via Implicit Trust OpenClaw’s pairing system is designed to facilitate temporary, low-privilege access for device registration. The /pair approve command, however, omits ex


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!