Rewrites.bio: 60x speedup in Genomics QC and AI rewrite principles for Science

1.1The opportunity#

Bioinformatics datasets have grown faster than the tools that process them. The cost shows up in compute bills, carbon footprint, and time to results.

Scale: A pipeline like nf-core/rnaseq runs on millions of samples per year. Ten wasted minutes per sample is decades of compute, annually.
Environment: Wasted CPU-hours mean kilowatt-hours and CO2. Rewrites running 10-50× faster can save hundreds of tonnes of emissions.
Science: Hours waiting for results are hours not spent testing hypotheses or exploring new directions.

AI coding assistants now let a domain expert produce a working reimplementation in days rather than years. Fast code is now cheap. Scientific insight, validation, and trust are not.

Example: RustQC

Time per sample - RustQC on 10 GB BAM

Traditional tools

15:34:00

RustQC

00:14:54

At 100k samples/year:

63x

faster

1.5M

CPU-hours saved

98%

less energy

~150t

CO₂e saved

1.2The risk#

Cheap code is not the same as correct code. AI-generated implementations are fluent and fast, but confidently wrong in ways that are easy to miss.

Correctness: A rewrite that silently produces different results poisons every analysis built on it.
Attribution: Ignoring the original authors' contributions damages the incentive structures that made the tool possible.
Fragmentation: Competing rewrites with different behaviour leave users unsure which to trust. Abandoned projects block better alternatives.

The existing tools are correct, and that correctness is hard-won. Rewrites inherit a responsibility: to match outputs exactly, to credit the work they stand on, and to be honest about what they have and have not validated.

A wave of AI-assisted rewrites is coming. The question is not whether it will happen, but whether it will happen well.

Behind every bioinformatics tool is years of work: algorithm design, edge-case hunting, user feedback, and maintenance. A rewrite stands entirely on that foundation. Without the original, a rewrite cannot even be validated.

Make credit visible:

README and documentation
Papers and preprints
Downstream reports using the rewrite's results
Credits page with citations, DOIs, author lists, and versions

When users cite your rewrite, they should know to cite the original too. This is how science tracks provenance. Obscuring lineage undermines the incentives that produced the original work.

Your paper uses fastalign-rs uses fastalign-rs rewrite built on slowalign original tool DOI 10.xxxx/xxxxxx must cite optional Required citation Optional citation

The goal is a faster tool that produces the same results. A rewrite that changes outputs, even as "improvements", is a different tool. It cannot inherit the original's validation or be substituted without re-validation.

Emulate exactly means:

Deterministic tools: byte-for-byte identical output files
Floating-point tools: results within acceptable numerical precision (defined by scientists, not convenience)
Everything counts: header formats, column ordering, file naming, summary statistics

Good rewrite

Exact same shape

✓ fits

Bad rewrite

Close, but wrong shape

✗ doesn't fit

When an AI writes the implementation, say so. Correctness is determined by output comparison, not by who wrote the code - but users should know how it was built.

Document clearly in the README and any paper:

AI tools: which were used and what role they played
Verification: how correctness was validated (e.g., "against mytool v1.0 on real sequencing data")
Gaps: where validation coverage is thin or incomplete

Output comparison catches what you tested, not what you haven't. Commit to ongoing validation as new use cases emerge.

README.md - Provenance Section

AI Assistance Disclosure

This tool was written with the assistance of AI coding agents (Claude, GitHub Copilot).

Correctness is validated by comparing output against mytool v1.0 on a suite of real sequencing datasets - not by manual code review alone. The AI generated the implementation; humans defined the validation criteria and verified the results.

🤖 AI-assisted

✓ Validated vs mytool v1.0

Validation methodology: github.com/user/mytool/VALIDATION.md

The biggest performance wins come from rethinking architecture, not just porting code. Modern pipelines chain discrete tools, each doing I/O. A straight port to a faster language captures only a fraction of what is available.

Questions to ask:

Upstream: what preprocessing does this tool assume? Can that be folded in?
Downstream: what immediately consumes the output? Can that be folded in too?
Intermediate files: can steps share data in memory instead of writing gigabytes of temporary BAM files?

This doesn't mean building an unvalidatable monolith. But consider what a pipeline designed from scratch would look like, knowing where the bottlenecks are.

Before - each tool reads the file independently

input

↓ rethink the architecture

After - single pass

input

Thinking big about architecture does not mean building big in one go. The implementation strategy should be the opposite: small, tight iteration loops, each step validated against the original before moving on.

Why this matters:

Localised debugging: when something is wrong, you already know where to look
Continuous validation: AI agents are unreliable for correctness - validate every step
Productive iteration: "output doesn't match on paired-end data" is actionable; "implement the whole pipeline" is not

Start with the simplest function that produces testable output. Validate it. Then extend.

01 Pick one function 02 Write rewrite 03 Compare outputs Mismatch Match → next function

Synthetic data is useful for quick iteration but insufficient for validation. Real sequencing data has error patterns and quality profiles that generators don't replicate.

Test comprehensively:

Diversity: multiple organisms, platforms, and library preps
Edge cases: empty files, single-read files, very large files
Benchmarks: document hardware, dataset, exact commands, and output comparison

Speed claims need evidence. Document your hardware, dataset, and exact commands so anyone can reproduce the comparison. If performance varies across data types, show that too.

Validation coverage

Organisms

Platforms

Hardware

x86

ARM

HPC

GPU

Edge cases

empty

1 read

50 GB

bad

tested not yet

Feature completeness is not the goal - output correctness on the features that matter is. Most tools have accumulated flags and edge cases over years; not all are used in your target pipeline.

A focused approach:

Audit usage: identify exactly which capabilities your pipeline uses
Fail loudly: for unsupported features, error with a clear message pointing to the original
Never ignore: silently dropping flags or producing wrong output breaks trust

A rewrite that does four things correctly is more valuable than one that claims fifteen and does twelve right. Scope can expand as the project matures; it cannot easily contract once users depend on it.

Scope Definition - samtools rewrite (example)

✓

sort coordinate & name sort

✓

index BAI generation

✓

flagstat alignment statistics

✓

view format conversion

○

mpileup not used in pipeline

○

merge handled upstream

○

fasta/fastq not in critical path

○

depth deferred to v2

Every equivalence claim is implicitly about a specific version. "Compatible with mytool" is meaningless. "Bit-identical output to mytool v1.0, validated on the dataset in benchmarks/" is testable.

Document in the repository (not a blog post):

Versions: exact version of every original tool validated against
Commands: exact command lines used for comparison
Data: datasets, methodology, and date
Policy: how you handle upstream version updates

When the original releases a new version, you need a process: major bumps often change formats, patches may fix bugs affecting your outputs. Communicate your policy to users.

Validated tool-a tool-b tool-c v0.9 v1.0 v1.1 v2.1 v2.3 v2.4 v0.8 v0.9 v1.0

Someone needs to be responsible for what comes next, and that should be visible before you release.

Plan for the long term:

Triage issues as users find edge cases your test suite missed
Re-validate when the original tool releases updates
Expand benchmarks as new data types and edge cases surface
Review PRs and grow a contributor community

Put governance in place before release: CONTRIBUTING.md, CHANGELOG.md, versioned tags, changelogs, and a commitment to acknowledge issues.

CONTRIBUTING.md

How to submit issues & PRs

CHANGELOG.md

Document every release

Versioned tags

Changelogs with every release

Triage issues

Users will find edge cases you missed

Re-validate

When upstream tools release updates

Expand benchmarks

New data types and edge cases

Review PRs

Grow a contributor community

The strongest argument for adopting a rewrite is zero migration cost. Identical outputs mean swapping into an existing pipeline is a one-line config change.

What identical means:

File names: same output naming conventions
Formats: same column headers, same ordering, same summary statistics
Parsing: every MultiQC module and downstream script works unchanged

This reversibility makes adoption risk-free. Researchers running production pipelines cannot re-validate their entire analysis history every time they update a tool - but they will try a rewrite they can test on one run, confirm matches, and roll back if needed.

original

drag to swap

rewrite

The tools you are rewriting were built by the open-source community, validated against publicly funded data, and distributed freely. A rewrite that closes the source extracts value without contributing back.

Open source matters for adoption:

Transparency: researchers need to inspect code to understand what it does to their data
Accessibility: HPC clusters often don't have commercial software available
Continuity: users can fork if the project is abandoned

Check licenses before you start. Permissive licenses (MIT, Apache-2.0, BSD) let you release under any license; copyleft (GPL) requires compatible licensing. Whether a reimplementation is a "derived work" depends on jurisdiction - err on the side of caution.

License archaeology is far easier before you ship than after.

Transparency

Researchers can inspect what the code does to their data

Continuity

Users can fork if the project is abandoned

Accessibility

HPC clusters often can't run commercial software

Reproducibility

Pinned versions require access to the source

Building a rewrite puts you in an unusual position: you've studied the original closely and found its edge cases. A reproducible bug report is a real contribution to science. But be careful - AI agents are confidently wrong in subtle ways.

Before filing upstream:

Verify manually with the original tool on clean, well-understood data
Rule out your reimplementation, malformed input, or a misread spec
Never use AI-generated test cases as evidence

Do not automate bug reports. Open-source maintainers are usually volunteers - a firehose of issues burns out people whose work benefits everyone.

The bar for filing: you fully understand the problem yourself, you can create a minimal reproducible example using the upstream tool that demonstrates it, and you've considered whether you can also contribute a fix.

Before Reporting a Bug Upstream

Found a potential bug? Reproducible with original tool? Yes No Fix in your rewrite Verified with real data? Yes No Verify first Can you explain why? Yes No Understand first Report upstream

Rewrites.bio: 60x speedup in Genomics QC and AI rewrite principles for Science

AI Assistance Disclosure

Daily AI Digest

Knowledge Map

Connected Articles — Knowledge Graph

Discussion