Products model application analysis insight national world model

MCMC Island Hopping: An Intuitive Guide to the Metropolis-Hastings Algorithm

DEV Communityby Berkan SesenApril 1, 202614 min read0 views

<p>Imagine you're a politician touring a chain of islands. Each island has a different population, and you want to spend time on each island <strong>in proportion to its population</strong> — more time on crowded islands, less on quiet ones. The catch: you have no map. You can only see the island you're on and the one next door.</p> <p>This is the core idea behind the Metropolis-Hastings algorithm, one of the most important algorithms in computational statistics. By the end of this post, you'll understand how a simple random walk can recover any target distribution, and you'll implement it from scratch.</p> <p>The algorithm was originally developed for simulating equations of state in physics by <a href="https://bayes.wustl.edu/Manual/EquationsOfState.pdf" rel="noopener noreferrer">Metropo

Imagine you're a politician touring a chain of islands. Each island has a different population, and you want to spend time on each island in proportion to its population — more time on crowded islands, less on quiet ones. The catch: you have no map. You can only see the island you're on and the one next door.

This is the core idea behind the Metropolis-Hastings algorithm, one of the most important algorithms in computational statistics. By the end of this post, you'll understand how a simple random walk can recover any target distribution, and you'll implement it from scratch.

The algorithm was originally developed for simulating equations of state in physics by Metropolis et al. (1953), and later generalised by Hastings (1970).

The Problem

You have a target distribution — a list of values representing how much time you should spend at each location. In our island example, the populations are:

Island 0 1 2 3 4 5 6

Population 2 3 1 5 8 2 9

You want to generate samples from this distribution. That is, if you record which island you visit at each time step, the histogram of visits should look like the population distribution — island 6 (population 9) should appear about 9 times more often than island 2 (population 1).

Why not just sample directly? In real applications, the distribution is often a complex posterior in thousands of dimensions. You can evaluate $P(x)$ at any point, but you can't normalise it or sample from it directly. MCMC sidesteps this entirely.

Quick Win: Run the Algorithm

Let's see the island hopper in action. Click the badge to open the interactive notebook:

Here's the complete implementation:

import numpy as np import matplotlib.pyplot as plt

import numpy as np import matplotlib.pyplot as plt

def metropolis_island_hopping(target, n_samples, start=0, burn_in=1000): """ Metropolis-Hastings sampler for a discrete distribution using nearest-neighbour proposals (the 'island hopping' variant).

Args: target: List of unnormalised target values (e.g. island populations) n_samples: Total number of steps (including burn-in) start: Starting island index burn_in: Steps to discard before collecting samples

Returns: Array of visit counts for each island """ n_islands = len(target) visit_counts = np.zeros(n_islands) current = start

for step in range(n_samples):

PROPOSE: pick a neighbour uniformly at random

if current == 0: proposed = 1 elif current == n_islands - 1: proposed = n_islands - 2 else: proposed = current + np.random.choice([-1, 1])

CORRECTION: account for asymmetric proposals at boundaries

q(proposed -> current) / q(current -> proposed)

n_neighbours_current = 2 if 0 < current < n_islands - 1 else 1 n_neighbours_proposed = 2 if 0 < proposed < n_islands - 1 else 1 proposal_ratio = n_neighbours_current / n_neighbours_proposed

ACCEPT/REJECT: Metropolis-Hastings ratio

acceptance_ratio = (target[proposed] / target[current]) * proposal_ratio*

if acceptance_ratio >= 1 or np.random.random() < acceptance_ratio: current = proposed

Record visit (after burn-in)

if step >= burn_in: visit_counts[current] += 1

return visit_counts

Target distribution: island populations

target = [2, 3, 1, 5, 8, 2, 9]

Run the sampler

visits = metropolis_island_hopping(target, n_samples=50_000, burn_in=1000)`

Enter fullscreen mode

Exit fullscreen mode

The result: With few samples the histogram is noisy, but as the chain runs longer the orange bars converge to match the blue target — the sampler spends time on each island in proportion to its population, even though it only ever looks at one neighbour at a time.

What Just Happened?

The algorithm uses a beautifully simple loop: propose, evaluate, accept or reject. Let's walk through each part.

The Proposal: Pick a Neighbour

if current == 0:  proposed = 1 elif current == n_islands - 1:  proposed = n_islands - 2 else:  proposed = current + np.random.choice([-1, 1])

if current == 0:  proposed = 1 elif current == n_islands - 1:  proposed = n_islands - 2 else:  proposed = current + np.random.choice([-1, 1])

Enter fullscreen mode

Exit fullscreen mode

Think of this as flipping a coin to decide whether to look left or right. At the boundaries (island 0 or island 6), there's only one direction to go. This asymmetry matters — we'll correct for it next.

The Proposal Correction

n_neighbours_current = 2 if 0 < current < n_islands - 1 else 1 n_neighbours_proposed = 2 if 0 < proposed < n_islands - 1 else 1 proposal_ratio = n_neighbours_current / n_neighbours_proposed

n_neighbours_current = 2 if 0 < current < n_islands - 1 else 1 n_neighbours_proposed = 2 if 0 < proposed < n_islands - 1 else 1 proposal_ratio = n_neighbours_current / n_neighbours_proposed

Enter fullscreen mode

Exit fullscreen mode

If you're on island 0 (one neighbour), you'll always propose island 1. But from island 1 (two neighbours), you only propose island 0 half the time. This asymmetry would bias the samples if we didn't correct for it.

The correction factor $q(\text{current} \mid \text{proposed}) / q(\text{proposed} \mid \text{current})$ ensures detailed balance — the fundamental property that makes MCMC work.

The Accept/Reject Decision

acceptance_ratio = (target[proposed] / target[current]) * proposal_ratio*

if acceptance_ratio >= 1 or np.random.random() < acceptance_ratio: current = proposed`

Enter fullscreen mode

Exit fullscreen mode

This is the heart of the algorithm. If the proposed island has a higher population, always move there. If it has a lower population, move with probability equal to the population ratio.

For example, if you're on island 4 (population 8) and propose island 3 (population 5):

You'd move with 62.5% probability. This means you sometimes visit less popular islands — which is essential for exploring the full distribution.

Why Burn-In?

if step >= burn_in:  visit_counts[current] += 1

if step >= burn_in:  visit_counts[current] += 1

Enter fullscreen mode

Exit fullscreen mode

The starting position is arbitrary. The first few hundred steps are influenced by where you started, not by the target distribution. Discarding these "burn-in" samples removes this initial bias.

Going Deeper

The Acceptance Ratio

The general Metropolis-Hastings acceptance probability is:

Where:

$\pi(x')$ is the target distribution at the proposed state
$\pi(x)$ is the target distribution at the current state
$q(x' \mid x)$ is the probability of proposing $x'$ from $x$
$q(x \mid x')$ is the probability of proposing $x$ from $x'$

The ratio $\pi(x') / \pi(x)$ only needs unnormalised values — the normalising constant cancels out. This is why MCMC is so powerful: you never need to compute the full normalisation integral.

When the proposal distribution is symmetric ($q(x' \mid x) = q(x \mid x')$), the correction factor disappears and we get the simpler Metropolis algorithm.

Hyperparameters

Parameter What it controls Guidance

n_samples Total chain length More is better; 10,000-100,000 typical

burn_in Samples discarded from the start 10-20% of n_samples; check trace plots

start Initial position Shouldn't matter after burn-in, but starting near high-probability regions helps

Proposal width How far each proposal jumps Too small = slow exploration; too large = high rejection rate

Detailed Balance: Why It Works

The algorithm satisfies detailed balance:

This says: the probability of being at $x$ and transitioning to $x'$ equals the probability of being at $x'$ and transitioning to $x$. Any Markov chain satisfying this condition has $\pi$ as its stationary distribution — so running the chain long enough guarantees convergence to the target.

When NOT to Use Metropolis-Hastings

High dimensions — Random walk proposals become inefficient; use Hamiltonian Monte Carlo (HMC) instead
Multimodal targets — The chain can get stuck in one mode; consider parallel tempering
Strong correlations — Axis-aligned proposals struggle; use adaptive proposals or Gibbs sampling
When you need speed — Variational inference is much faster (but approximate)

Alternatives:

Gibbs sampling for conditionally conjugate models
HMC/NUTS for continuous high-dimensional posteriors (what Stan uses)
Variational inference when speed matters more than exactness

Deep Dive: The Papers

Metropolis et al. (1953)

The original paper, "Equation of State Calculations by Fast Computing Machines", was written at Los Alamos National Laboratory. The authors — Nicholas Metropolis, Arianna and Marshall Rosenbluth, and Augusta and Edward Teller — needed to simulate the behaviour of molecules to compute thermodynamic properties.

Their key insight: instead of evaluating an intractable integral over all possible molecular configurations, generate representative configurations by random sampling and average over those.

The purpose of this paper is to describe a general method, suitable for fast computing machines, for investigating such properties as equations of state for substances consisting of interacting individual molecules.

The algorithm they proposed uses symmetric proposals only — a random perturbation of one particle's position. The acceptance rule was:

Where $\Delta E$ is the energy difference between the proposed and current configurations. In our island analogy, "energy" maps to the negative log of the population — lower energy means higher population.

Hastings (1970)

W.K. Hastings generalised the algorithm in "Monte Carlo Sampling Methods Using Markov Chains and Their Applications", allowing asymmetric proposal distributions. This is exactly the correction factor we implemented for the boundary islands.

Hastings showed that the acceptance probability:

guarantees detailed balance for any proposal distribution $q$, not just symmetric ones. This seemingly small generalisation massively expanded the applicability of MCMC — it enabled samplers with directed proposals, mixture proposals, and eventually led to modern algorithms like HMC.

Our Implementation vs the Papers

Paper concept Our code

Target $\pi(x)$

target[i] (island populations)

Proposal $q(x' \mid x)$

Uniform over neighbours (left/right)

Hastings correction

proposal_ratio for boundary asymmetry

Energy $\Delta E$

-log(target[proposed]) + log(target[current]) (implicit)

Historical Context

Metropolis et al. (1953) — Original algorithm for molecular simulation
Hastings (1970) — Generalisation to asymmetric proposals
Geman & Geman (1984) — Gibbs sampling (special case of MH)
Gelfand & Smith (1990) — Brought MCMC to mainstream statistics
Neal (2011) — Hamiltonian Monte Carlo for efficient high-dimensional sampling

Try It Yourself

The interactive notebook includes exercises:

Trace plots — Plot the island index over time to visualise the random walk
Burn-in sensitivity — How does changing burn_in affect the result?
Chain length — Run with 1,000 vs 100,000 samples and compare accuracy
Cyclic islands — Modify the code so island 0 and island 6 are neighbours (wrap around)
Continuous target — Replace the discrete islands with a Gaussian distribution and a continuous proposal

Understanding Metropolis-Hastings opens the door to Bayesian inference, probabilistic programming (Stan, PyMC), and modern sampling methods like HMC. The core insight, that a carefully designed random walk converges to any target distribution, is one of the most powerful ideas in computational statistics.

Interactive Tools

Markov Chain Simulator — Visualise Markov chain dynamics, transition matrices, and stationary distributions
Bayes' Theorem Calculator — Compute posterior probabilities that MCMC approximates at scale

From MLE to Bayesian Inference — The Bayesian framework that MCMC makes computationally tractable
Hierarchical Bayesian Regression with PyMC — MCMC sampling in action with PyMC
Bayesian Survival Analysis with PyMC — Custom likelihoods sampled via MCMC
MCMC Poisson Mixture: Earthquake Regimes — MCMC for real-world regime detection

Frequently Asked Questions

What is MCMC and why do I need it?

MCMC (Markov Chain Monte Carlo) is a family of algorithms for sampling from probability distributions that are difficult to sample from directly. You need it when you want to do Bayesian inference but the posterior distribution has no closed-form solution, which is the case for most real-world models.

How does the Metropolis-Hastings algorithm work?

Start at a random point, propose a new point by adding random noise, and accept or reject the proposal based on how much more probable the new point is. If it is more probable, always accept. If less probable, accept with a probability equal to the ratio. Over many iterations, the chain of accepted points forms samples from the target distribution.

How do I know if my MCMC chain has converged?

Check trace plots for good mixing (the chain explores the full range of plausible values without getting stuck). Run multiple chains from different starting points and check that they agree (R-hat close to 1.0). Discard early samples as burn-in. If the chain looks like white noise around a stable mean, it has likely converged.

What is the burn-in period and how long should it be?

Burn-in is the initial period where the chain is still moving from its arbitrary starting point towards the high-probability region. These samples are discarded because they do not represent the target distribution. The length depends on the problem; monitor the trace plot and discard until the chain appears stationary. Typically 10-50% of total samples.

What is the acceptance rate and what should it be?

The acceptance rate is the fraction of proposed moves that are accepted. Too high (above 50%) means steps are too small and the chain explores slowly. Too low (below 15%) means steps are too large and most are rejected. For a multivariate target, aim for about 23% acceptance rate; for a univariate target, about 44%.

Original source

DEV Community

https://dev.to/berkan_sesen/mcmc-island-hopping-an-intuitive-guide-to-the-metropolis-hastings-algorithm-531d

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modelapplicationanalysis

ProductsLive

Beyond CRUD: A Practical Guide to Modeling Business Processes in REST APIs

<p>Most developers agree that naming things is one of the hardest parts of our job. Designing a clear, effective RESTful API comes in as a close runner-up.</p> <p>On the surface, REST makes a lot of sense. Compared to its dominant predecessor, SOAP, which involved hammering a <code>POST</code> endpoint with unwieldy XML requests, REST feels like a breath of fresh air. By combining an <a href="https://www.rfc-editor.org/rfc/rfc9110#section-9" rel="noopener noreferrer">HTTP Method</a> with a <a href="https://developer.mozilla.org/en-US/docs/Web/URI" rel="noopener noreferrer">URI (Uniform Resource Identifier)</a>, you can create (<code>POST</code>), read (<code>GET</code>), update (<code>PUT</code> or <code>PATCH</code>), and delete (<code>DELETE</code>) resources. Simple and intuitive, right

DEV Community

14m19 minutes ago

ProductsLive

Why ChatGPT Cites Your Competitors (Not You)

<p><a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fab4rjm31qc9t27eimxi1.png" class="article-body-image-wrapper"><img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fab4rjm31qc9t27eimxi1.png" alt=" " width="800" height="533"></a><br> AI engines are sending traffic to your competitors — even when yours is the better site. Here's why it happens, what AEO and GEO actually mean, and how to fix it.</p> <p>Nobody told me search was changing. I had to find out the embarrassing way.</p> <p>Last year I recommended a client's website to someone.

DEV Community

7m16 minutes ago

ModelsLive

TurboQuant, KIVI, and the Real Cost of Long-Context KV Cache

<h1> I Built a Free KV Cache Calculator for LLM Inference </h1> <p>When people talk about LLM deployment costs, they usually start with model weights.</p> <p>That makes sense, but once you push context length higher, KV cache becomes one of the real bottlenecks. In many long-context setups, it is the<br> dynamic memory cost that quietly starts dominating deployment decisions.</p> <p>I built a small free tool to make that easier to estimate:</p> <p><a href="https://turbo-quant.com/en/kv-cache-calculator" rel="noopener noreferrer">TurboQuant Tools</a></p> <p>It is a practical KV cache calculator for LLM inference. You can use it to estimate memory for:</p> <ul> <li>MHA models</li> <li>GQA models</li> <li>MQA models</li> <li>different context lengths</li> <li>different batch sizes</li> <li>di

DEV Community

3m16 minutes ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 155 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!

More in Products

ProductsRecent

Formo

<p> Formo makes analytics simple for DeFi so you can grow. </p> <p> <a href="https://www.producthunt.com/products/formo?utm_campaign=producthunt-atom-posts-feed&utm_medium=rss-feed&utm_source=producthunt-atom-posts-feed">Discussion</a> | <a href="https://www.producthunt.com/r/p/1112328?app_id=339">Link</a> </p>

Product Hunt

1mabout 23 hours ago

ProductsLive

Beyond CRUD: A Practical Guide to Modeling Business Processes in REST APIs

DEV Community

14m19 minutes ago

ProductsLive

From idea to live web app in minutes with Spektrum. An AI-powered web app builder for MVPs, rapid prototyping, and full-stack JavaScript apps. Skip setup, generate real products, and deploy instantly without infrastructure headaches. 🔥

<div class="ltag__link--embedded"> <div class="crayons-story "> <a href="https://dev.to/jigjoy/spektrum-turn-natural-language-into-live-web-apps-deploy-in-minutes-with-ai-5292" class="crayons-story__hidden-navigation-link">Spektrum: Turn Natural Language into Live Web Apps (Deploy in Minutes with AI)</a> <div class="crayons-story__body crayons-story__body-full_post"> <div class="crayons-story__top"> <div class="crayons-story__meta"> <div class="crayons-story__author-pic"> <a class="crayons-logo crayons-logo--l" href="/jigjoy"> <img alt="JigJoy logo" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F12673%2F0815940b-2b5b-4cdd-9626-9dd0bf5f8be4.png

DEV Community

2m17 minutes ago

ProductsLive

Why ChatGPT Cites Your Competitors (Not You)

DEV Community

7m16 minutes ago

MCMC Island Hopping: An Intuitive Guide to the Metropolis-Hastings Algorithm

The Problem

Quick Win: Run the Algorithm

PROPOSE: pick a neighbour uniformly at random

CORRECTION: account for asymmetric proposals at boundaries

q(proposed -> current) / q(current -> proposed)

ACCEPT/REJECT: Metropolis-Hastings ratio

Record visit (after burn-in)

Target distribution: island populations

Run the sampler

What Just Happened?

The Proposal: Pick a Neighbour

The Proposal Correction

The Accept/Reject Decision

Why Burn-In?

Going Deeper

The Acceptance Ratio

Hyperparameters

Detailed Balance: Why It Works

When NOT to Use Metropolis-Hastings

Deep Dive: The Papers

Metropolis et al. (1953)

Hastings (1970)

Our Implementation vs the Papers

Historical Context

Further Reading

Try It Yourself

Interactive Tools

Related Posts

Frequently Asked Questions

What is MCMC and why do I need it?

How does the Metropolis-Hastings algorithm work?

How do I know if my MCMC chain has converged?

What is the burn-in period and how long should it be?

What is the acceptance rate and what should it be?

Daily AI Digest

More about

Beyond CRUD: A Practical Guide to Modeling Business Processes in REST APIs

Why ChatGPT Cites Your Competitors (Not You)

TurboQuant, KIVI, and the Real Cost of Long-Context KV Cache

Knowledge Map

Connected Articles — Knowledge Graph

Discussion

More in Products

Formo

Beyond CRUD: A Practical Guide to Modeling Business Processes in REST APIs

From idea to live web app in minutes with Spektrum. An AI-powered web app builder for MVPs, rapid prototyping, and full-stack JavaScript apps. Skip setup, generate real products, and deploy instantly without infrastructure headaches. 🔥

Why ChatGPT Cites Your Competitors (Not You)