Rewrites.bio: 60x speedup in Genomics QC and AI rewrite principles for Science
Article URL: https://rewrites.bio/ Comments URL: https://news.ycombinator.com/item?id=47616595 Points: 7 # Comments: 1
1.1The opportunity#
Bioinformatics datasets have grown faster than the tools that process them. The cost shows up in compute bills, carbon footprint, and time to results.
- Scale: A pipeline like nf-core/rnaseq runs on millions of samples per year. Ten wasted minutes per sample is decades of compute, annually.
- Environment: Wasted CPU-hours mean kilowatt-hours and CO2. Rewrites running 10-50× faster can save hundreds of tonnes of emissions.
- Science: Hours waiting for results are hours not spent testing hypotheses or exploring new directions.
AI coding assistants now let a domain expert produce a working reimplementation in days rather than years. Fast code is now cheap. Scientific insight, validation, and trust are not.
Example: RustQC
Time per sample - RustQC on 10 GB BAM
Traditional tools
15:34:00
RustQC
00:14:54
At 100k samples/year:
63x
faster
1.5M
CPU-hours saved
98%
less energy
~150t
CO₂e saved
1.2The risk#
Cheap code is not the same as correct code. AI-generated implementations are fluent and fast, but confidently wrong in ways that are easy to miss.
- Correctness: A rewrite that silently produces different results poisons every analysis built on it.
- Attribution: Ignoring the original authors' contributions damages the incentive structures that made the tool possible.
- Fragmentation: Competing rewrites with different behaviour leave users unsure which to trust. Abandoned projects block better alternatives.
The existing tools are correct, and that correctness is hard-won. Rewrites inherit a responsibility: to match outputs exactly, to credit the work they stand on, and to be honest about what they have and have not validated.
A wave of AI-assisted rewrites is coming. The question is not whether it will happen, but whether it will happen well.
Behind every bioinformatics tool is years of work: algorithm design, edge-case hunting, user feedback, and maintenance. A rewrite stands entirely on that foundation. Without the original, a rewrite cannot even be validated.
Make credit visible:
- README and documentation
- Papers and preprints
- Downstream reports using the rewrite's results
- Credits page with citations, DOIs, author lists, and versions
When users cite your rewrite, they should know to cite the original too. This is how science tracks provenance. Obscuring lineage undermines the incentives that produced the original work.
Your paper uses fastalign-rs uses fastalign-rs rewrite built on slowalign original tool DOI 10.xxxx/xxxxxx must cite optional Required citation Optional citation
The goal is a faster tool that produces the same results. A rewrite that changes outputs, even as "improvements", is a different tool. It cannot inherit the original's validation or be substituted without re-validation.
Emulate exactly means:
- Deterministic tools: byte-for-byte identical output files
- Floating-point tools: results within acceptable numerical precision (defined by scientists, not convenience)
- Everything counts: header formats, column ordering, file naming, summary statistics
Good rewrite
Exact same shape
✓ fits
Bad rewrite
Close, but wrong shape
✗ doesn't fit
When an AI writes the implementation, say so. Correctness is determined by output comparison, not by who wrote the code - but users should know how it was built.
Document clearly in the README and any paper:
- AI tools: which were used and what role they played
- Verification: how correctness was validated (e.g., "against mytool v1.0 on real sequencing data")
- Gaps: where validation coverage is thin or incomplete
Output comparison catches what you tested, not what you haven't. Commit to ongoing validation as new use cases emerge.
README.md - Provenance Section
AI Assistance Disclosure
This tool was written with the assistance of AI coding agents (Claude, GitHub Copilot).
Correctness is validated by comparing output against mytool v1.0 on a suite of real sequencing datasets - not by manual code review alone. The AI generated the implementation; humans defined the validation criteria and verified the results.
🤖 AI-assisted
✓ Validated vs mytool v1.0
Validation methodology: github.com/user/mytool/VALIDATION.md
The biggest performance wins come from rethinking architecture, not just porting code. Modern pipelines chain discrete tools, each doing I/O. A straight port to a faster language captures only a fraction of what is available.
Questions to ask:
- Upstream: what preprocessing does this tool assume? Can that be folded in?
- Downstream: what immediately consumes the output? Can that be folded in too?
- Intermediate files: can steps share data in memory instead of writing gigabytes of temporary BAM files?
This doesn't mean building an unvalidatable monolith. But consider what a pipeline designed from scratch would look like, knowing where the bottlenecks are.
Before - each tool reads the file independently
input
↓ rethink the architecture
After - single pass
input
Thinking big about architecture does not mean building big in one go. The implementation strategy should be the opposite: small, tight iteration loops, each step validated against the original before moving on.
Why this matters:
- Localised debugging: when something is wrong, you already know where to look
- Continuous validation: AI agents are unreliable for correctness - validate every step
- Productive iteration: "output doesn't match on paired-end data" is actionable; "implement the whole pipeline" is not
Start with the simplest function that produces testable output. Validate it. Then extend.
01 Pick one function 02 Write rewrite 03 Compare outputs Mismatch Match → next function
Synthetic data is useful for quick iteration but insufficient for validation. Real sequencing data has error patterns and quality profiles that generators don't replicate.
Test comprehensively:
- Diversity: multiple organisms, platforms, and library preps
- Edge cases: empty files, single-read files, very large files
- Benchmarks: document hardware, dataset, exact commands, and output comparison
Speed claims need evidence. Document your hardware, dataset, and exact commands so anyone can reproduce the comparison. If performance varies across data types, show that too.
Validation coverage
Organisms
Hs
Mm
Dm
Dr
At
Ce
Platforms
A
B
C
D
E
Hardware
x86
ARM
HPC
GPU
Edge cases
empty
1 read
50 GB
PE
SE
bad
tested not yet
Feature completeness is not the goal - output correctness on the features that matter is. Most tools have accumulated flags and edge cases over years; not all are used in your target pipeline.
A focused approach:
- Audit usage: identify exactly which capabilities your pipeline uses
- Fail loudly: for unsupported features, error with a clear message pointing to the original
- Never ignore: silently dropping flags or producing wrong output breaks trust
A rewrite that does four things correctly is more valuable than one that claims fifteen and does twelve right. Scope can expand as the project matures; it cannot easily contract once users depend on it.
Scope Definition - samtools rewrite (example)
✓
sort coordinate & name sort
✓
index BAI generation
✓
flagstat alignment statistics
✓
view format conversion
○
mpileup not used in pipeline
○
merge handled upstream
○
fasta/fastq not in critical path
○
depth deferred to v2
Every equivalence claim is implicitly about a specific version. "Compatible with mytool" is meaningless. "Bit-identical output to mytool v1.0, validated on the dataset in benchmarks/" is testable.
Document in the repository (not a blog post):
- Versions: exact version of every original tool validated against
- Commands: exact command lines used for comparison
- Data: datasets, methodology, and date
- Policy: how you handle upstream version updates
When the original releases a new version, you need a process: major bumps often change formats, patches may fix bugs affecting your outputs. Communicate your policy to users.
Validated tool-a tool-b tool-c v0.9 v1.0 v1.1 v2.1 v2.3 v2.4 v0.8 v0.9 v1.0
Someone needs to be responsible for what comes next, and that should be visible before you release.
Plan for the long term:
- Triage issues as users find edge cases your test suite missed
- Re-validate when the original tool releases updates
- Expand benchmarks as new data types and edge cases surface
- Review PRs and grow a contributor community
Put governance in place before release: CONTRIBUTING.md, CHANGELOG.md, versioned tags, changelogs, and a commitment to acknowledge issues.
CONTRIBUTING.md
How to submit issues & PRs
CHANGELOG.md
Document every release
Versioned tags
Changelogs with every release
Triage issues
Users will find edge cases you missed
Re-validate
When upstream tools release updates
Expand benchmarks
New data types and edge cases
Review PRs
Grow a contributor community
The strongest argument for adopting a rewrite is zero migration cost. Identical outputs mean swapping into an existing pipeline is a one-line config change.
What identical means:
- File names: same output naming conventions
- Formats: same column headers, same ordering, same summary statistics
- Parsing: every MultiQC module and downstream script works unchanged
This reversibility makes adoption risk-free. Researchers running production pipelines cannot re-validate their entire analysis history every time they update a tool - but they will try a rewrite they can test on one run, confirm matches, and roll back if needed.
original
drag to swap
rewrite
The tools you are rewriting were built by the open-source community, validated against publicly funded data, and distributed freely. A rewrite that closes the source extracts value without contributing back.
Open source matters for adoption:
- Transparency: researchers need to inspect code to understand what it does to their data
- Accessibility: HPC clusters often don't have commercial software available
- Continuity: users can fork if the project is abandoned
Check licenses before you start. Permissive licenses (MIT, Apache-2.0, BSD) let you release under any license; copyleft (GPL) requires compatible licensing. Whether a reimplementation is a "derived work" depends on jurisdiction - err on the side of caution.
License archaeology is far easier before you ship than after.
Transparency
Researchers can inspect what the code does to their data
Continuity
Users can fork if the project is abandoned
Accessibility
HPC clusters often can't run commercial software
Reproducibility
Pinned versions require access to the source
Building a rewrite puts you in an unusual position: you've studied the original closely and found its edge cases. A reproducible bug report is a real contribution to science. But be careful - AI agents are confidently wrong in subtle ways.
Before filing upstream:
- Verify manually with the original tool on clean, well-understood data
- Rule out your reimplementation, malformed input, or a misread spec
- Never use AI-generated test cases as evidence
Do not automate bug reports. Open-source maintainers are usually volunteers - a firehose of issues burns out people whose work benefits everyone.
The bar for filing: you fully understand the problem yourself, you can create a minimal reproducible example using the upstream tool that demonstrates it, and you've considered whether you can also contribute a fix.
Before Reporting a Bug Upstream
Found a potential bug? Reproducible with original tool? Yes No Fix in your rewrite Verified with real data? Yes No Verify first Can you explain why? Yes No Understand first Report upstream
Sign in to highlight and annotate this article

Conversation starters
Daily AI Digest
Get the top 5 AI stories delivered to your inbox every morning.
Knowledge Map
Connected Articles — Knowledge Graph
This article is connected to other articles through shared AI topics and tags.


Discussion
Sign in to join the discussion
No comments yet — be the first to share your thoughts!