MCMC Island Hopping: An Intuitive Guide to the Metropolis-Hastings Algorithm
<p>Imagine you're a politician touring a chain of islands. Each island has a different population, and you want to spend time on each island <strong>in proportion to its population</strong> — more time on crowded islands, less on quiet ones. The catch: you have no map. You can only see the island you're on and the one next door.</p> <p>This is the core idea behind the Metropolis-Hastings algorithm, one of the most important algorithms in computational statistics. By the end of this post, you'll understand how a simple random walk can recover any target distribution, and you'll implement it from scratch.</p> <p>The algorithm was originally developed for simulating equations of state in physics by <a href="https://bayes.wustl.edu/Manual/EquationsOfState.pdf" rel="noopener noreferrer">Metropo
Imagine you're a politician touring a chain of islands. Each island has a different population, and you want to spend time on each island in proportion to its population — more time on crowded islands, less on quiet ones. The catch: you have no map. You can only see the island you're on and the one next door.
This is the core idea behind the Metropolis-Hastings algorithm, one of the most important algorithms in computational statistics. By the end of this post, you'll understand how a simple random walk can recover any target distribution, and you'll implement it from scratch.
The algorithm was originally developed for simulating equations of state in physics by Metropolis et al. (1953), and later generalised by Hastings (1970).
The Problem
You have a target distribution — a list of values representing how much time you should spend at each location. In our island example, the populations are:
Island 0 1 2 3 4 5 6
Population 2 3 1 5 8 2 9
You want to generate samples from this distribution. That is, if you record which island you visit at each time step, the histogram of visits should look like the population distribution — island 6 (population 9) should appear about 9 times more often than island 2 (population 1).
Why not just sample directly? In real applications, the distribution is often a complex posterior in thousands of dimensions. You can evaluate $P(x)$ at any point, but you can't normalise it or sample from it directly. MCMC sidesteps this entirely.
Quick Win: Run the Algorithm
Let's see the island hopper in action. Click the badge to open the interactive notebook:
Here's the complete implementation:
import numpy as np import matplotlib.pyplot as pltimport numpy as np import matplotlib.pyplot as pltdef metropolis_island_hopping(target, n_samples, start=0, burn_in=1000): """ Metropolis-Hastings sampler for a discrete distribution using nearest-neighbour proposals (the 'island hopping' variant).
Args: target: List of unnormalised target values (e.g. island populations) n_samples: Total number of steps (including burn-in) start: Starting island index burn_in: Steps to discard before collecting samples
Returns: Array of visit counts for each island """ n_islands = len(target) visit_counts = np.zeros(n_islands) current = start
for step in range(n_samples):
PROPOSE: pick a neighbour uniformly at random
if current == 0: proposed = 1 elif current == n_islands - 1: proposed = n_islands - 2 else: proposed = current + np.random.choice([-1, 1])
CORRECTION: account for asymmetric proposals at boundaries
q(proposed -> current) / q(current -> proposed)
n_neighbours_current = 2 if 0 < current < n_islands - 1 else 1 n_neighbours_proposed = 2 if 0 < proposed < n_islands - 1 else 1 proposal_ratio = n_neighbours_current / n_neighbours_proposed
ACCEPT/REJECT: Metropolis-Hastings ratio
acceptance_ratio = (target[proposed] / target[current]) * proposal_ratio*
if acceptance_ratio >= 1 or np.random.random() < acceptance_ratio: current = proposed
Record visit (after burn-in)
if step >= burn_in: visit_counts[current] += 1
return visit_counts
Target distribution: island populations
target = [2, 3, 1, 5, 8, 2, 9]
Run the sampler
visits = metropolis_island_hopping(target, n_samples=50_000, burn_in=1000)`
Enter fullscreen mode
Exit fullscreen mode
The result: With few samples the histogram is noisy, but as the chain runs longer the orange bars converge to match the blue target — the sampler spends time on each island in proportion to its population, even though it only ever looks at one neighbour at a time.
What Just Happened?
The algorithm uses a beautifully simple loop: propose, evaluate, accept or reject. Let's walk through each part.
The Proposal: Pick a Neighbour
if current == 0: proposed = 1 elif current == n_islands - 1: proposed = n_islands - 2 else: proposed = current + np.random.choice([-1, 1])if current == 0: proposed = 1 elif current == n_islands - 1: proposed = n_islands - 2 else: proposed = current + np.random.choice([-1, 1])Enter fullscreen mode
Exit fullscreen mode
Think of this as flipping a coin to decide whether to look left or right. At the boundaries (island 0 or island 6), there's only one direction to go. This asymmetry matters — we'll correct for it next.
The Proposal Correction
n_neighbours_current = 2 if 0 < current < n_islands - 1 else 1 n_neighbours_proposed = 2 if 0 < proposed < n_islands - 1 else 1 proposal_ratio = n_neighbours_current / n_neighbours_proposedn_neighbours_current = 2 if 0 < current < n_islands - 1 else 1 n_neighbours_proposed = 2 if 0 < proposed < n_islands - 1 else 1 proposal_ratio = n_neighbours_current / n_neighbours_proposedEnter fullscreen mode
Exit fullscreen mode
If you're on island 0 (one neighbour), you'll always propose island 1. But from island 1 (two neighbours), you only propose island 0 half the time. This asymmetry would bias the samples if we didn't correct for it.
The correction factor $q(\text{current} \mid \text{proposed}) / q(\text{proposed} \mid \text{current})$ ensures detailed balance — the fundamental property that makes MCMC work.
The Accept/Reject Decision
acceptance_ratio = (target[proposed] / target[current]) * proposal_ratio*
if acceptance_ratio >= 1 or np.random.random() < acceptance_ratio: current = proposed`
Enter fullscreen mode
Exit fullscreen mode
This is the heart of the algorithm. If the proposed island has a higher population, always move there. If it has a lower population, move with probability equal to the population ratio.
For example, if you're on island 4 (population 8) and propose island 3 (population 5):
You'd move with 62.5% probability. This means you sometimes visit less popular islands — which is essential for exploring the full distribution.
Why Burn-In?
if step >= burn_in: visit_counts[current] += 1if step >= burn_in: visit_counts[current] += 1Enter fullscreen mode
Exit fullscreen mode
The starting position is arbitrary. The first few hundred steps are influenced by where you started, not by the target distribution. Discarding these "burn-in" samples removes this initial bias.
Going Deeper
The Acceptance Ratio
The general Metropolis-Hastings acceptance probability is:
Where:
-
$\pi(x')$ is the target distribution at the proposed state
-
$\pi(x)$ is the target distribution at the current state
-
$q(x' \mid x)$ is the probability of proposing $x'$ from $x$
-
$q(x \mid x')$ is the probability of proposing $x$ from $x'$
The ratio $\pi(x') / \pi(x)$ only needs unnormalised values — the normalising constant cancels out. This is why MCMC is so powerful: you never need to compute the full normalisation integral.
When the proposal distribution is symmetric ($q(x' \mid x) = q(x \mid x')$), the correction factor disappears and we get the simpler Metropolis algorithm.
Hyperparameters
Parameter What it controls Guidance
n_samples
Total chain length
More is better; 10,000-100,000 typical
burn_in
Samples discarded from the start
10-20% of n_samples; check trace plots
start
Initial position
Shouldn't matter after burn-in, but starting near high-probability regions helps
Proposal width How far each proposal jumps Too small = slow exploration; too large = high rejection rate
Detailed Balance: Why It Works
The algorithm satisfies detailed balance:
This says: the probability of being at $x$ and transitioning to $x'$ equals the probability of being at $x'$ and transitioning to $x$. Any Markov chain satisfying this condition has $\pi$ as its stationary distribution — so running the chain long enough guarantees convergence to the target.
When NOT to Use Metropolis-Hastings
-
High dimensions — Random walk proposals become inefficient; use Hamiltonian Monte Carlo (HMC) instead
-
Multimodal targets — The chain can get stuck in one mode; consider parallel tempering
-
Strong correlations — Axis-aligned proposals struggle; use adaptive proposals or Gibbs sampling
-
When you need speed — Variational inference is much faster (but approximate)
Alternatives:
-
Gibbs sampling for conditionally conjugate models
-
HMC/NUTS for continuous high-dimensional posteriors (what Stan uses)
-
Variational inference when speed matters more than exactness
Deep Dive: The Papers
Metropolis et al. (1953)
The original paper, "Equation of State Calculations by Fast Computing Machines", was written at Los Alamos National Laboratory. The authors — Nicholas Metropolis, Arianna and Marshall Rosenbluth, and Augusta and Edward Teller — needed to simulate the behaviour of molecules to compute thermodynamic properties.
Their key insight: instead of evaluating an intractable integral over all possible molecular configurations, generate representative configurations by random sampling and average over those.
The purpose of this paper is to describe a general method, suitable for fast computing machines, for investigating such properties as equations of state for substances consisting of interacting individual molecules.
The algorithm they proposed uses symmetric proposals only — a random perturbation of one particle's position. The acceptance rule was:
Where $\Delta E$ is the energy difference between the proposed and current configurations. In our island analogy, "energy" maps to the negative log of the population — lower energy means higher population.
Hastings (1970)
W.K. Hastings generalised the algorithm in "Monte Carlo Sampling Methods Using Markov Chains and Their Applications", allowing asymmetric proposal distributions. This is exactly the correction factor we implemented for the boundary islands.
Hastings showed that the acceptance probability:
guarantees detailed balance for any proposal distribution $q$, not just symmetric ones. This seemingly small generalisation massively expanded the applicability of MCMC — it enabled samplers with directed proposals, mixture proposals, and eventually led to modern algorithms like HMC.
Our Implementation vs the Papers
Paper concept Our code
Target $\pi(x)$
target[i] (island populations)
Proposal $q(x' \mid x)$
Uniform over neighbours (left/right)
Hastings correction
proposal_ratio for boundary asymmetry
Energy $\Delta E$
-log(target[proposed]) + log(target[current]) (implicit)
Historical Context
-
Metropolis et al. (1953) — Original algorithm for molecular simulation
-
Hastings (1970) — Generalisation to asymmetric proposals
-
Geman & Geman (1984) — Gibbs sampling (special case of MH)
-
Gelfand & Smith (1990) — Brought MCMC to mainstream statistics
-
Neal (2011) — Hamiltonian Monte Carlo for efficient high-dimensional sampling
Further Reading
-
Metropolis et al. (1953) — The original paper from Los Alamos
-
Hastings (1970) — The generalisation to asymmetric proposals
-
Chib & Greenberg (1995) — Excellent tutorial in The American Statistician
-
Bishop's PRML Chapter 11 — Modern textbook coverage of MCMC methods
Try It Yourself
The interactive notebook includes exercises:
-
Trace plots — Plot the island index over time to visualise the random walk
-
Burn-in sensitivity — How does changing burn_in affect the result?
-
Chain length — Run with 1,000 vs 100,000 samples and compare accuracy
-
Cyclic islands — Modify the code so island 0 and island 6 are neighbours (wrap around)
-
Continuous target — Replace the discrete islands with a Gaussian distribution and a continuous proposal
Understanding Metropolis-Hastings opens the door to Bayesian inference, probabilistic programming (Stan, PyMC), and modern sampling methods like HMC. The core insight, that a carefully designed random walk converges to any target distribution, is one of the most powerful ideas in computational statistics.
Interactive Tools
-
Markov Chain Simulator — Visualise Markov chain dynamics, transition matrices, and stationary distributions
-
Bayes' Theorem Calculator — Compute posterior probabilities that MCMC approximates at scale
Related Posts
-
From MLE to Bayesian Inference — The Bayesian framework that MCMC makes computationally tractable
-
Hierarchical Bayesian Regression with PyMC — MCMC sampling in action with PyMC
-
Bayesian Survival Analysis with PyMC — Custom likelihoods sampled via MCMC
-
MCMC Poisson Mixture: Earthquake Regimes — MCMC for real-world regime detection
Frequently Asked Questions
What is MCMC and why do I need it?
MCMC (Markov Chain Monte Carlo) is a family of algorithms for sampling from probability distributions that are difficult to sample from directly. You need it when you want to do Bayesian inference but the posterior distribution has no closed-form solution, which is the case for most real-world models.
How does the Metropolis-Hastings algorithm work?
Start at a random point, propose a new point by adding random noise, and accept or reject the proposal based on how much more probable the new point is. If it is more probable, always accept. If less probable, accept with a probability equal to the ratio. Over many iterations, the chain of accepted points forms samples from the target distribution.
How do I know if my MCMC chain has converged?
Check trace plots for good mixing (the chain explores the full range of plausible values without getting stuck). Run multiple chains from different starting points and check that they agree (R-hat close to 1.0). Discard early samples as burn-in. If the chain looks like white noise around a stable mean, it has likely converged.
What is the burn-in period and how long should it be?
Burn-in is the initial period where the chain is still moving from its arbitrary starting point towards the high-probability region. These samples are discarded because they do not represent the target distribution. The length depends on the problem; monitor the trace plot and discard until the chain appears stationary. Typically 10-50% of total samples.
What is the acceptance rate and what should it be?
The acceptance rate is the fraction of proposed moves that are accepted. Too high (above 50%) means steps are too small and the chain explores slowly. Too low (below 15%) means steps are too large and most are rejected. For a multivariate target, aim for about 23% acceptance rate; for a univariate target, about 44%.
DEV Community
https://dev.to/berkan_sesen/mcmc-island-hopping-an-intuitive-guide-to-the-metropolis-hastings-algorithm-531dSign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
More about
modelapplicationanalysis
Beyond CRUD: A Practical Guide to Modeling Business Processes in REST APIs
<p>Most developers agree that naming things is one of the hardest parts of our job. Designing a clear, effective RESTful API comes in as a close runner-up.</p> <p>On the surface, REST makes a lot of sense. Compared to its dominant predecessor, SOAP, which involved hammering a <code>POST</code> endpoint with unwieldy XML requests, REST feels like a breath of fresh air. By combining an <a href="https://www.rfc-editor.org/rfc/rfc9110#section-9" rel="noopener noreferrer">HTTP Method</a> with a <a href="https://developer.mozilla.org/en-US/docs/Web/URI" rel="noopener noreferrer">URI (Uniform Resource Identifier)</a>, you can create (<code>POST</code>), read (<code>GET</code>), update (<code>PUT</code> or <code>PATCH</code>), and delete (<code>DELETE</code>) resources. Simple and intuitive, right

Why ChatGPT Cites Your Competitors (Not You)
<p><a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fab4rjm31qc9t27eimxi1.png" class="article-body-image-wrapper"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fab4rjm31qc9t27eimxi1.png" alt=" " width="800" height="533"></a><br> AI engines are sending traffic to your competitors — even when yours is the better site. Here's why it happens, what AEO and GEO actually mean, and how to fix it.</p> <p>Nobody told me search was changing. I had to find out the embarrassing way.</p> <p>Last year I recommended a client's website to someone.

TurboQuant, KIVI, and the Real Cost of Long-Context KV Cache
<h1> I Built a Free KV Cache Calculator for LLM Inference </h1> <p>When people talk about LLM deployment costs, they usually start with model weights.</p> <p>That makes sense, but once you push context length higher, KV cache becomes one of the real bottlenecks. In many long-context setups, it is the<br> dynamic memory cost that quietly starts dominating deployment decisions.</p> <p>I built a small free tool to make that easier to estimate:</p> <p><a href="https://turbo-quant.com/en/kv-cache-calculator" rel="noopener noreferrer">TurboQuant Tools</a></p> <p>It is a practical KV cache calculator for LLM inference. You can use it to estimate memory for:</p> <ul> <li>MHA models</li> <li>GQA models</li> <li>MQA models</li> <li>different context lengths</li> <li>different batch sizes</li> <li>di
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.
More in Products
Formo
<p> Formo makes analytics simple for DeFi so you can grow. </p> <p> <a href="https://www.producthunt.com/products/formo?utm_campaign=producthunt-atom-posts-feed&utm_medium=rss-feed&utm_source=producthunt-atom-posts-feed">Discussion</a> | <a href="https://www.producthunt.com/r/p/1112328?app_id=339">Link</a> </p>

Beyond CRUD: A Practical Guide to Modeling Business Processes in REST APIs
<p>Most developers agree that naming things is one of the hardest parts of our job. Designing a clear, effective RESTful API comes in as a close runner-up.</p> <p>On the surface, REST makes a lot of sense. Compared to its dominant predecessor, SOAP, which involved hammering a <code>POST</code> endpoint with unwieldy XML requests, REST feels like a breath of fresh air. By combining an <a href="https://www.rfc-editor.org/rfc/rfc9110#section-9" rel="noopener noreferrer">HTTP Method</a> with a <a href="https://developer.mozilla.org/en-US/docs/Web/URI" rel="noopener noreferrer">URI (Uniform Resource Identifier)</a>, you can create (<code>POST</code>), read (<code>GET</code>), update (<code>PUT</code> or <code>PATCH</code>), and delete (<code>DELETE</code>) resources. Simple and intuitive, right

From idea to live web app in minutes with Spektrum. An AI-powered web app builder for MVPs, rapid prototyping, and full-stack JavaScript apps. Skip setup, generate real products, and deploy instantly without infrastructure headaches. 🔥
<div class="ltag__link--embedded"> <div class="crayons-story "> <a href="https://dev.to/jigjoy/spektrum-turn-natural-language-into-live-web-apps-deploy-in-minutes-with-ai-5292" class="crayons-story__hidden-navigation-link">Spektrum: Turn Natural Language into Live Web Apps (Deploy in Minutes with AI)</a> <div class="crayons-story__body crayons-story__body-full_post"> <div class="crayons-story__top"> <div class="crayons-story__meta"> <div class="crayons-story__author-pic"> <a class="crayons-logo crayons-logo--l" href="/jigjoy"> <img alt="JigJoy logo" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F12673%2F0815940b-2b5b-4cdd-9626-9dd0bf5f8be4.png

Why ChatGPT Cites Your Competitors (Not You)
<p><a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fab4rjm31qc9t27eimxi1.png" class="article-body-image-wrapper"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fab4rjm31qc9t27eimxi1.png" alt=" " width="800" height="533"></a><br> AI engines are sending traffic to your competitors — even when yours is the better site. Here's why it happens, what AEO and GEO actually mean, and how to fix it.</p> <p>Nobody told me search was changing. I had to find out the embarrassing way.</p> <p>Last year I recommended a client's website to someone.

Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!