Joseph Rich

Genomics Is Not NLP: A Field Guide for ML Scientists

2026-06-03T00:00:00+00:00

Why a language-model expert’s intuitions misfire

DNA is the most beguiling analogy in all of machine learning. It is a string. It is written in a tiny alphabet. You read it left to right. It has motifs that look like words, genes that look like sentences, and a “grammar” that biologists have spent a century annotating. If you have trained transformers on text, the leap to genomics feels like a short one — same architecture, new corpus.

It is not a short one. The architectures really do carry over (transformers, state-space models, masked-language-model pre-training, increasingly genomic “foundation models”), which is exactly what makes the analogy dangerous: it hides everything that is different. This post is a field guide for machine-learning scientists moving from natural language into genomics and transcriptomics. What changes is not the network. It is the statistics of the signal, the meaning of a “token,” the fact that the entire species shares essentially one sequence, the biology you must encode to read a single variant, and — the part that quietly sinks most projects — the molecule you can cheaply measure is not the molecule that actually does anything.

A running theme, the mirror image of the one I used for radiology: some things here are genuinely easier than language, and a few are catastrophically harder. (For the imaging counterpart of this argument, see Radiology AI Is Not Computer Vision.) Knowing which is which is the difference between a model that tops a benchmark and one that says something true about biology.

The alphabet is tiny — and stranger than text

Start with the surface, because the surface is where the false comfort lives.

Line up like with like before drawing the comparison. The right counterpart to DNA’s four letters — $\{A, C, G, T\}$, with RNA swapping $T$ for $U$ — is not an NLP tokenizer’s 32,000–100,000-token vocabulary but the 26 letters of written English: both are the raw character set from which everything else is assembled. On that axis DNA’s alphabet is merely small, not exotic. The comparison only gets interesting one level up, at the word — and there the genome’s closest analogue is the gene, of which humans have only ~20,000 (Section 7), each one hundreds to thousands of letters long rather than five. So a four-letter alphabet “sounds like a gift,” and for sheer per-symbol modeling capacity it is — but the comfort is misplaced, because the difficulty was never in the alphabet. Three wrinkles make even the alphabet less clean than it looks.

It is not really four symbols. Real sequence files are littered with N, the “any base” placeholder for positions the sequencer could not call, and with the full IUPAC ambiguity set (R = A or G, Y = C or T, W, S, K, M, and so on) for positions known to be one of a subset. On top of that, reference genomes use case to encode a second channel: lowercase acgt marks soft-masked regions — repeats and low-complexity sequence flagged by tools like RepeatMasker — while uppercase marks the rest. So a and A are the same base carrying different metadata. A tokenizer that uppercases everything silently discards a curated annotation; one that treats a and A as distinct tokens doubles the alphabet for the wrong reason. Neither is obviously right, and the choice is yours to make consciously.

The hard axis is length, not alphabet. This inverts the usual NLP scaling axis: there is almost nothing to learn about the alphabet; all the difficulty is in the length. The haploid human genome is about $3.2 \times 10^{9}$ base pairs — a single “document” of 3.2 billion characters, two to three orders of magnitude longer than the entire training context of a long-context LLM. This is why tokenization is a live research question in genomics in a way it is not for English: single-nucleotide tokens give you faithful resolution but punishing sequence lengths; $k$-mer tokens (e.g. 6-mers) or byte-pair encodings shorten the sequence but blur the single-base substitutions that often are the signal. The thing you most want to detect — a one-letter change — is the thing aggressive tokenization destroys.

The string has a symmetry text does not. DNA is double-stranded, and the two strands carry the same information in complementary, reverse-ordered form: the reverse complement of 5'-GATTACA-3' is 5'-TGTAATC-3'. A gene can live on either strand, so a motif and its reverse complement are often biologically equivalent — an equivariance with no analogue in natural language, where reading a sentence backwards through a letter-substitution cipher is gibberish. Good genomic models build in reverse-complement equivariance (or augment with it); naive ones waste capacity relearning each motif twice. Relatedly, strand bias — the two strands accruing mutations or being sequenced at different rates — is a real artifact you must model, not a nuisance you can normalize away.

Almost every genome is the same genome

Here is the single biggest statistical difference from a text corpus, and it runs exactly opposite to intuition. Two English documents pulled at random share almost nothing at the token level. Two human genomes are 99.9% identical.

Pairwise nucleotide diversity in humans is about $\pi \approx 0.001$: roughly one site in a thousand differs between any two people. Across $3.2 \times 10^{9}$ bases, a typical genome carries on the order of 4–5 million sites that differ from the reference — which sounds like a lot until you remember it is $0.1\%$ of the sequence. The other $99.9\%$ is, position for position, the same book. Figure 1a puts this on a log axis against the divergence you would see between species ($\sim 1.2\%$ human–chimpanzee) and against text, where two unrelated documents differ at essentially every token.

Why so uniform? Human genetic variation is shallow because the population that gave rise to everyone alive passed through a long period of small effective population size ($N_e$ on the order of $10^{4}$) and a series of out-of-Africa bottlenecks tens of thousands of years ago (conventionally placed $\sim 50{,}000\text{–}70{,}000$ years ago). The practical consequence for a modeler is profound: the common variants you see are old and shared. They are ancestral polymorphisms that predate the bottleneck, inherited by everyone and merely reshuffled into new combinations by recombination each generation. There is not a fresh, independent draw of variation per person; there is one ancestral deck, dealt and re-dealt. On top of that shared deck, each newborn carries only dozens of brand-new mutations — about 70 de novo single-nucleotide variants per generation, from a per-base mutation rate near $1.2 \times 10^{-8}$. So a person is: the ancestral common variants (shuffled) $+$ a small private set of rare and de novo ones.

There is a second subtlety in which differences you are counting (Figure 1b). Count variation by events and single-nucleotide variants (SNVs) and small indels dominate — millions of them. Count it by bases affected and the picture flips: a typical genome harbors only a couple of thousand structural variants (deletions, duplications, inversions, insertions of mobile elements), but those rearrange roughly 20 Mb of sequence — far more nucleotide content than all the SNVs combined. Structural variants are simultaneously the largest source of differing bases and the least studied, because they are hard to call from short reads and awkward to represent against a single linear reference. Much of the “missing” signal in genomics lives in exactly the variation our tooling sees worst.

For an ML scientist, the redundancy is not a curiosity — it is a data-leakage hazard that dwarfs anything in NLP:

Your training and test “documents” are near-identical by construction. A random train/test split over individuals leaves the two sets sharing $99.9\%$ of their sequence and the overwhelming majority of their common variants. A model can score brilliantly by memorizing ancestral haplotypes that appear on both sides of the split.
Relatedness and population structure are confounders, not noise. Cryptic relatives, or simply two people of shared ancestry, share long haplotype blocks. This is the genomic analogue of a radiology model learning the scanner instead of the disease: here the model learns ancestry and launders it as signal. Split by family and, where the question demands it, by population; correct for structure with principal components or mixed models; and never trust an evaluation that could be won by recognizing where someone’s grandparents were born.
The reference itself is a bias. Mapping everyone to one linear reference genome systematically mishandles the variants — especially structural ones — that the reference happens not to contain. Pangenome graph references exist to fight this; ignoring it bakes a population-specific blind spot into your inputs.

Most of the genome is unread — and the labels need a wet lab or a patient

In a text corpus, every token means something to a competent human reader; meaning is in distribution to the annotator. Genomics is not like this. Only about $1$–$2\%$ of the human genome codes for protein. The rest is introns, regulatory elements, structural and repetitive sequence, and vast stretches whose function is genuinely unknown. The ENCODE project assigned “biochemical activity” to most of the genome, but biochemical activity is not the same as function, and the fraction of the genome whose role we can actually read off remains small. Most of the book is written in a language we have only begun to decode.

This reshapes what “supervised learning” can even mean, because the labels are the bottleneck, just as annotation is in radiology — and for a deeper reason than cost:

Function is not in the sequence the way meaning is in the text. You cannot look at a 200-base enhancer and read its effect the way you can read a sentence. Establishing what a non-coding region does requires a wet-lab experiment — a massively parallel reporter assay, a CRISPR perturbation screen, a knockout — or a population-scale association to a phenotype. The label lives in an experiment or in a patient, not in the characters.
“Ground truth” is often officially uncertain. Clinical variant databases enshrine this honesty: a large share of catalogued variants are Variants of Uncertain Significance (VUS) — we have seen them but cannot say whether they cause disease. Pathogenicity calls follow formal guidelines (ACMG/AMP), yet different labs reach conflicting classifications for the same variant often enough that reconciling them is its own field. If your training labels are pathogenic/benign calls, you are inheriting both the biology’s uncertainty and the curators’ disagreement — the genomic version of inter-reader variability.
You cannot eyeball the data. An NLP engineer can sanity-check a labeling pipeline by reading examples. Almost no one can look at a stretch of intron and tell whether a splice-site annotation is right. As in radiology, you need a biologist in the loop continuously, because the data-cleaning decisions (which transcripts to keep, how to treat a multi-mapping read, what counts as “expressed”) are biological judgments wearing a data-engineering disguise.

Meaning travels: cis, trans, and the limits of a context window

The defining structural fact of language modeling over the last few years has been the expanding context window — from a few thousand tokens to a million and beyond — on the premise that if a dependency exists, a long enough window will span it. Genomics tempts you to apply the same logic, and to a point it works: the state-of-the-art regulatory models read enormous windows. Enformer takes about $200\,\mathrm{kb}$ of sequence; DeepMind’s 2025 AlphaGenome ingests up to $\sim 1\,\mathrm{Mb}$ — a literal one-million-base context window — to predict regulatory activity. The parallel to long-context LLMs is exact, and deliberate.

But a linear window, however long, runs into a wall that has no NLP analogue, because genomic regulation is not confined to a line:

Cis-regulation is long-range but at least on-chromosome. Enhancers routinely act over hundreds of kilobases, skipping past nearer genes to their true targets. A $1\,\mathrm{Mb}$ window is a real attempt to capture this — and it captures a lot of it.
The genome is folded in 3D. Promoters and enhancers are brought into contact by chromatin looping within topologically associating domains. Two elements far apart on the sequence can be physically adjacent in the nucleus. Linear distance in your input is not regulatory distance in the cell.
Trans-regulation breaks the line entirely. A transcription factor encoded on chromosome 1 diffuses through the nucleus and binds targets on every chromosome. Trans-eQTLs — variants that affect the expression of genes far away, often on other chromosomes — are exactly this. No sliding window over a single locus, of any length, can see a regulator that lives on a different chromosome and acts through a protein intermediate that is not in the input sequence at all.

That last clause is the crux. In language, the relevant context is always more text; a bigger window is the right tool. In genomics, the relevant context is frequently a diffusible molecule, a 3D contact, or a cell-state variable that the DNA sequence does not contain. The state that determines what a sequence does is not fully written in the sequence. Stretching the context window from $200\,\mathrm{kb}$ to $1\,\mathrm{Mb}$ is genuine progress on the cis problem and buys nothing on the trans problem. Be precise about which one your model is actually solving.

You cannot read the data without biology

This is the section a language modeler is most tempted to skip and least able to afford skipping. A handful of biological facts are not background color; they change what a model must represent to be correct.

The reading frame and the genetic code. Protein-coding sequence is read in non-overlapping triplets (codons). With four letters, there are $4^3 = 64$ codons mapping onto 20 amino acids plus stop — a degenerate code, so several codons specify the same amino acid (most often differing in the third “wobble” position). Three consequences follow immediately: the code is frame-dependent (an insertion or deletion not divisible by three causes a frameshift that garbles everything downstream); the same protein can be written many ways; and a model operating on raw nucleotides has to learn a triplet structure that is given, not discovered.

Silent does not mean neutral. Because the code is degenerate, a single-base change can be synonymous (“silent”) — it leaves the amino acid unchanged. The naive inference is that synonymous variants do not matter. They often do: they can alter codon-usage and translation efficiency, mRNA folding and stability, and — crucially — they can create or destroy splice signals. The hierarchy a model should encode is synonymous / missense / nonsense, but with the explicit caveat that “synonymous” is a statement about the protein sequence, not about function.

Splicing and RNA processing. The path from gene to message is not a copy. A pre-mRNA is spliced — introns removed, exons joined — then capped at the $5'$ end and polyadenylated at the $3'$. A variant deep inside an intron, far from any coding base, can create a cryptic splice site and ruin a protein; a variant at an exon boundary can cause exon skipping. This is why “distance to the nearest coding base” is a terrible proxy for “importance,” and why models that ignore splicing miss an entire mechanism of disease.

Driver versus passenger. In cancer genomics the problem is explicitly a signal-detection one. A tumor genome accumulates thousands of somatic mutations, the vast majority of which are passengers — along for the ride, biologically inert. A handful are drivers that actually confer growth advantage. Distinguishing the few drivers from the many passengers, against a mutational background that varies across the genome, is the central inference task — the genomic needle-in-a-haystack.

Multiple testing is not optional. When you test millions of variants for association with a trait, or tens of thousands of genes for differential expression, the number of hypotheses is so large that uncorrected $p$-values are meaningless. This is why genome-wide association studies adopted a genome-wide significance threshold of $p < 5 \times 10^{-8}$ — essentially a Bonferroni correction for the $\sim 10^{6}$ independent common-variant tests across the genome. We return to the arithmetic in Figure 3b; for now, internalize that a “significant” hit at $p = 10^{-3}$ is, genome-wide, almost certainly noise.

Sequence similarity runs far above chance — and that is the whole point. Here is the calculation that should reframe how a string modeler thinks about DNA. Under a naive model of i.i.d. uniform bases, a specific $k$-mer is expected to occur $G \cdot 4^{-k}$ times in a genome of $G$ bases:

\[\mathbb{E}[\text{occurrences}] = G \cdot 4^{-k}.\]

For $G = 3.2 \times 10^{9}$, this crosses $1$ near $k \approx 16$ and collapses fast (Figure 2). At the $k = 31$ that bioinformatics tooling routinely uses for exact matching, the expected number of chance occurrences of a given 31-mer is

\[3.2\times 10^{9} \cdot 4^{-31} \approx 7 \times 10^{-10},\]

i.e. effectively never. And yet conserved 31-mers are shared constantly — between two people, between human and mouse, across hundreds of millions of years of divergence. The naive random-sequence model predicts these shared long $k$-mers should not exist; they exist anyway, by a factor of a billion. That gap is biology: purifying selection conserving functional sequence, preserved RNA secondary structure constraining which substitutions are tolerated, conserved amino-acid motifs (with synonymous wobble underneath), and repetitive elements copied across the genome. The lesson is that string coincidence is the wrong null. When two sequences match more than chance allows, that excess is the signal — homology, conservation, selection — and a model that treats DNA as a random string will systematically misread it.

The unit problem: long genes, many transcripts, slippery semantics

In language the semantic unit is convenient: a word is a few characters, a sentence a few dozen words, and meaning is reasonably local and human-readable. The genomic “word” is nothing like this.

Genes are long, and their meaning is delocalized. A protein-coding sequence is typically on the order of $1$–$2\,\mathrm{kb}$ (encoding a protein of very roughly $\sim 375$ amino acids, with wide spread across the proteome), but the gene — exons plus the introns between them — frequently spans tens to hundreds of kilobases; dystrophin spans about $2.2\,\mathrm{Mb}$. The information that specifies one protein is scattered across a huge genomic interval, interrupted by introns, and its realized “meaning” depends on cell type, developmental stage, and regulatory state. Compared with a five-letter English word whose meaning is right there on the page, the semantics of a gene are spread out, context-dependent, and much harder to capture in a fixed embedding.

One gene is many messages. It is common shorthand that humans have “about 20,000 genes” — and the protein-coding count, $\sim 19{,}900\text{–}20{,}000$ in GENCODE, is indeed remarkably small. But that number badly understates the functional vocabulary, because alternative splicing lets a single gene produce many distinct transcripts (isoforms). GENCODE annotates well over $200{,}000$ transcripts — an order of magnitude more than genes — and a single gene can yield dozens of isoforms with different, sometimes opposing, functions. So the mapping from “gene” to “thing that acts” is one-to-many, and a transcriptomic model that collapses expression to the gene level is averaging over functionally distinct products. The right unit is frequently the transcript, not the gene — and transcript-level labels are scarcer and noisier.

The molecule you sequence is not the molecule that acts

Now the deepest mismatch, and the one most likely to invalidate a confident conclusion. Proteins do the work of the cell — they catalyze, signal, transport, and build. DNA is the blueprint and RNA the working copy, but the actors are proteins. And yet the overwhelming majority of “expression” data, and nearly all of the trendy single-cell atlases, measure RNA, not protein. We routinely study the script and infer the performance.

That inference is shakier than the field’s habits suggest. Across many careful studies, the correlation between a gene’s mRNA level and its protein level is moderate at best — typically a Spearman $\rho$ in the $0.4$–$0.6$ range, and lower still when you look at changes over time rather than steady-state across genes. Schwanhäusser and colleagues found mRNA explained well under half the variance in protein abundance; Vogel and Marcotte, Liu, Beyer and Aebersold, and Buccitelli and Selbach all converge on the same message — translation rates, protein half-lives, and post-translational regulation drive a large share of protein levels that mRNA simply does not see. Edfors and colleagues showed the relationship is gene-specific: each gene has roughly its own mRNA-to-protein conversion factor, so a single global model is wrong per gene. Figure 3a illustrates the consequence — even at the optimistic end of that range, knowing a gene’s mRNA leaves its protein level uncertain across a wide band.

So why does the field overwhelmingly sequence RNA if protein is what matters? Not because anyone thinks RNA is the better readout — because of technology:

RNA can be amplified; protein cannot. Reverse transcription plus PCR turns a handful of molecules into a sequenceable library, so RNA-seq reaches single-cell and even single-molecule sensitivity. There is no PCR for proteins — no way to exponentially copy a polypeptide — so mass-spectrometry proteomics works with whatever is in the sample.
RNA is genome-templated, so we know what to look for. Every transcript maps back to a sequence we can align against a reference. Proteins must be inferred from fragmentary peptide spectra, and the proteome’s enormous dynamic range means abundant proteins drown out the rare ones we often care about most.
Throughput and cost. RNA-seq is cheap, standardized, and scales to millions of cells; comprehensive single-cell proteomics is still hard, lower-throughput, and far less complete.

The honest framing is the same one that recurs throughout this post: we optimize a convenient proxy. RNA abundance is to protein activity what a radiology label mined from a report is to the underlying pathology — useful, scalable, and systematically wrong in ways you must keep in view. A transcriptomic model that reports “expression” is making a claim about the script; whether the performance followed is a separate, and weaker, inference.

The data: a near-duplicate corpus, batch effects, and who is in it

Genomics is, paradoxically, both data-rich and data-poor. There is an enormous and growing public infrastructure (Table 1), far better than radiology’s. But the redundancy of Section 3, the batch effects below, and the demographics of who has been sequenced mean that effective sample size lags raw counts badly.

Table: Major public genomics / transcriptomics resources. Counts are as reported by the source publications; “variants” and “samples” are not comparable units across rows.

Resource	What it is	Reported scale	Citation (DOI)
1000 Genomes	Reference catalogue of human variation	2,504 individuals, 26 populations; ~88M variants	10.1038/nature15393
gnomAD	Aggregated exomes + genomes; constraint metrics	125,748 exomes + 15,708 genomes (v2)	10.1038/s41586-020-2308-7
UK Biobank	Population cohort, genotype + deep phenotype	~500,000 participants	10.1038/s41586-018-0579-z
TCGA	Pan-cancer tumor/normal multi-omics	~11,000 tumors, 33 cancer types	10.1038/ng.2764
GTEx	Genetic regulation of expression across tissues	17,382 RNA-seq samples, 54 tissues, 948 donors	10.1126/science.aaz1776
ENCODE	Functional/regulatory element annotation	Genome-wide assays across many cell types	10.1038/nature11247
GENCODE	Reference gene/transcript annotation	~20,000 coding genes; >200,000 transcripts	10.1093/nar/gkaa1087
Geuvadis	RNA-seq paired to 1000 Genomes genotypes	462 individuals, 5 populations	10.1038/nature12531
Tabula Sapiens	Multi-organ single-cell atlas	~500,000 cells, ~24 tissues	10.1126/science.abl4896
T2T-CHM13	First complete (telomere-to-telomere) human genome	1 gapless assembly	10.1126/science.abj6987

Two structural problems run underneath these numbers.

Batch effects are the scanner-heterogeneity of genomics. A sequencing readout is the end of a long wet-and-dry pipeline, and every stage is a covariate that shifts across labs: library preparation chemistry, sequencing platform (short-read Illumina vs. long-read PacBio/Nanopore), read length and depth, PCR amplification bias, RNA quality (RIN) and degradation, the alignment software, and — easy to forget — the reference build itself (GRCh37 vs. GRCh38 vs. the new T2T-CHM13). Two expression datasets can differ more by batch than by biology, and models readily learn the batch. The discipline that grew up around this — careful normalization, batch-correction methods, mixed models, harmonized pipelines — is the genomic counterpart to vendor-aware augmentation and intensity normalization in imaging. As there, the danger is symmetric: under-correct and you measure the lab; over-correct and you erase the biology.

The corpus is not representative of humanity. A large majority of participants in genome-wide studies are of European ancestry. This is the genomic version of the subgroup-power trap: a polygenic risk score trained predominantly on European- ancestry data transfers poorly to people of other ancestries, because the tag variants, allele frequencies, and linkage structure differ. A model can post excellent aggregate metrics and still be least accurate for the populations most underserved by existing tools. Splitting and evaluating by ancestry, and stating plainly which groups you are and are not powered to serve, is not optional diligence — it is the difference between a fair tool and an inequitable one.

The famous models — and what they do and don’t solve

The reason this analogy is everywhere right now is that the transformer toolkit has produced genuinely landmark genomics results. It is worth knowing the map, and being precise about what each model does and does not address from the list above.

AlphaFold2 (Jumper et al., 2021) predicts protein 3D structure from amino-acid sequence at near-experimental accuracy — arguably the field’s defining success. Note what it sidesteps: it operates on the protein, taking the molecule that acts as a given, and says nothing about whether or how much of that protein the cell makes.
Enformer (Avsec et al., 2021) and AlphaGenome (DeepMind, 2025) attack the cis-regulatory problem head-on, predicting expression and chromatin readouts from sequence across $\sim 200\,\mathrm{kb}$ and up to $\sim 1\,\mathrm{Mb}$ windows respectively. They are the state of the art on long-range cis effects — and, per Section 5, structurally blind to trans regulation that acts through diffusible proteins or other chromosomes.
DNABERT (Ji et al., 2021), the Nucleotide Transformer (Dalla-Torre et al., 2024), and Evo (Nguyen et al., 2024) are DNA “language models” — masked or autoregressive pre-training over genomic sequence, transferred to downstream tasks. They inherit, and must confront, every tokenization and redundancy issue in Sections 2–3.
scGPT (Cui et al., 2024) and Geneformer (Theodoris et al., 2023) bring the foundation-model recipe to single-cell transcriptomics, learning representations of cell state from large RNA-expression atlases — which means they live entirely on the RNA side of the proxy gap in Section 8.

The pattern across the map is the through-line of this post: these models are spectacular within the slice of the problem they address, and it is on the modeler to know which slice that is. AlphaFold takes the protein as input; the regulatory models see only cis; the single-cell models see only RNA. None of that diminishes them — it just means the honest question is never “does the benchmark go up,” but “which part of the biology did this actually capture, and which part is still missing.”

Takeaways

If you remember five things moving from natural language to genomics:

The alphabet is a trap, not a gift. Four letters (plus N, IUPAC codes, and case-as-metadata), but the difficulty is the 3.2-billion-character length, the reverse-complement symmetry, and a tokenization choice that can destroy the single-base signal you came for.
The whole species is one near-duplicate corpus. Two genomes differ at $0.1\%$ of sites; common variants are old and shared, private variation is dozens of mutations, and most differing bases hide in understudied structural variants. Plan your splits around leakage, relatedness, and population structure from day one.
Most of the genome is unread, and labels live in experiments or patients. Function is not in the sequence the way meaning is in text; ground truth is often an official “uncertain,” and you cannot eyeball it. Keep a biologist in the loop.
Regulation defeats the context window. A $1\,\mathrm{Mb}$ window is real progress on cis and no progress on trans: the determining context is often a protein, a 3D contact, or a cell state that the sequence does not contain.
You are usually modeling a proxy. RNA is not protein, and the correlation is only $\sim 0.4$–$0.6$; “expression” is the script, not the performance. Encode the biology — codons, splicing, silent-but-not-neutral, drivers vs. passengers, multiple testing, similarity-beyond-chance — or your string model will confidently misread the genome.

See the accompanying notebook.ipynb for the redundancy arithmetic, the $k$-mer calculation, the proxy simulation, the multiple-testing counts behind Figures 1–3, and an automated check that every citation below resolves.

References

Auton A, Brooks LD, Durbin RM, et al. A global reference for human genetic variation. Nature. 2015;526(7571):68–74. doi:10.1038/nature15393
Sudmant PH, Rausch T, Gardner EJ, et al. An integrated map of structural variation in 2,504 human genomes. Nature. 2015;526(7571):75–81. doi:10.1038/nature15394
Karczewski KJ, Francioli LC, Tiao G, et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature. 2020;581(7809):434–443. doi:10.1038/s41586-020-2308-7
Bycroft C, Freeman C, Petkova D, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature. 2018;562(7726):203–209. doi:10.1038/s41586-018-0579-z
Weinstein JN, Collisson EA, Mills GB, et al. The Cancer Genome Atlas Pan-Cancer analysis project. Nat Genet. 2013;45(10):1113–1120. doi:10.1038/ng.2764
GTEx Consortium. The GTEx Consortium atlas of genetic regulatory effects across human tissues. Science. 2020;369(6509):1318–1330. doi:10.1126/science.aaz1776
ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012;489(7414):57–74. doi:10.1038/nature11247
Frankish A, Diekhans M, Jungreis I, et al. GENCODE 2021. Nucleic Acids Res. 2021;49(D1):D916–D923. doi:10.1093/nar/gkaa1087
Lappalainen T, Sammeth M, Friedländer MR, et al. Transcriptome and genome sequencing uncovers functional variation in humans. Nature. 2013;501(7468):506–511. doi:10.1038/nature12531
Tabula Sapiens Consortium. The Tabula Sapiens: a multiple-organ, single-cell transcriptomic atlas of humans. Science. 2022;376(6594):eabl4896. doi:10.1126/science.abl4896
Nurk S, Koren S, Rhie A, et al. The complete sequence of a human genome. Science. 2022;376(6588):44–53. doi:10.1126/science.abj6987
Kong A, Frigge ML, Masson G, et al. Rate of de novo mutations and the importance of father’s age to disease risk. Nature. 2012;488(7412):471–475. doi:10.1038/nature11396
Schwanhäusser B, Busse D, Li N, et al. Global quantification of mammalian gene expression control. Nature. 2011;473(7347):337–342. doi:10.1038/nature10098
Vogel C, Marcotte EM. Insights into the regulation of protein abundance from proteomic and transcriptomic analyses. Nat Rev Genet. 2012;13(4):227–232. doi:10.1038/nrg3185
Liu Y, Beyer A, Aebersold R. On the dependency of cellular protein levels on mRNA abundance. Cell. 2016;165(3):535–550. doi:10.1016/j.cell.2016.03.014
Edfors F, Danielsson F, Hallström BM, et al. Gene-specific correlation of RNA and protein levels in human cells and tissues. Mol Syst Biol. 2016;12(10):883. doi:10.15252/msb.20167144
Buccitelli C, Selbach M. mRNAs, proteins and the emerging principles of gene expression control. Nat Rev Genet. 2020;21(10):630–644. doi:10.1038/s41576-020-0258-4
Jumper J, Evans R, Pritzel A, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–589. doi:10.1038/s41586-021-03819-2
Avsec Ž, Agarwal V, Visentin D, et al. Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods. 2021;18(10):1196–1203. doi:10.1038/s41592-021-01252-x
Avsec Ž, Latysheva N, Cheng J, et al. AlphaGenome: advancing regulatory variant effect prediction with a unified DNA sequence model. bioRxiv. 2025. doi:10.1101/2025.06.25.661532. See also https://deepmind.google/blog/alphagenome-ai-for-better-understanding-the-genome/
Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained Bidirectional Encoder Representations from Transformers model for DNA-language in genome. Bioinformatics. 2021;37(15):2112–2120. doi:10.1093/bioinformatics/btab083
Dalla-Torre H, Gonzalez L, Mendoza-Revilla J, et al. Nucleotide Transformer: building and evaluating robust foundation models for human genomics. Nat Methods. 2024;22(2):287–297. doi:10.1038/s41592-024-02523-z
Nguyen E, Poli M, Durrant MG, et al. Sequence modeling and design from molecular to genome scale with Evo. Science. 2024;386(6723):eado9336. doi:10.1126/science.ado9336
Cui H, Wang C, Maan H, et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat Methods. 2024;21(8):1470–1480. doi:10.1038/s41592-024-02201-0
Theodoris CV, Xiao L, Chopra A, et al. Transfer learning enables predictions in network biology. Nature. 2023;618(7965):616–624. doi:10.1038/s41586-023-06139-9

Reproduce all analyses in this post here.

Radiology AI Is Not Computer Vision: A Field Guide for ML Scientists

2026-06-02T00:00:00+00:00

Why a computer-vision expert’s intuitions misfire

If you have trained a model on ImageNet, COCO, or a few hundred million Instagram photos, you have excellent instincts for natural-image vision. Most of those instincts are wrong — or at least dangerously incomplete — the moment you point them at a chest CT or a screening mammogram.

This post is a field guide for machine-learning scientists moving into radiology. It is not a survey of architectures; the architectures are mostly the ones you already know (CNNs, U-Nets, vision transformers, increasingly foundation models). What changes is everything around the architecture: the statistics of the signal, the cost and meaning of a label, the data you can actually get, and — the part that quietly sinks most projects — generalization across the bewildering heterogeneity of how medical images are produced. I will end with the two things ML scientists most often discover too late: how the FDA actually regulates these models, and why the model in the paper is rarely the model that ships.

A running theme: medical imaging is in some ways easier than natural-image vision, and leaning on those advantages is the difference between a model that demos well and one that survives contact with a second hospital.

What is genuinely easier than natural images

Start with the good news, because it is real and underexploited.

Canonical pose and framing. A street scene can contain a cat at any scale, any orientation, anywhere in the frame, against any background. A PA chest radiograph is, by protocol, a patient standing upright, facing the detector, arms positioned to rotate the scapulae off the lung fields. The heart is on the left.¹ The aortic knob is where the aortic knob goes. This is a strong spatial prior that natural-image models simply do not get for free — and it is why registration, atlas-based priors, and even fixed positional encodings work far better here than they would on web images.

One channel, calibrated. Most modalities are grayscale, and — crucially — the gray values often mean something physical. CT is quantitative: each voxel is a Hounsfield unit, a linear transform of the X-ray attenuation coefficient $\mu$ relative to water,

\[\mathrm{HU} = 1000 \times \frac{\mu - \mu_{\text{water}}}{\mu_{\text{water}} - \mu_{\text{air}}},\]

so water is $0$, air is $-1000$, fat is around $-100$, and cortical bone is $+1000$ or more. Fat is fat in every CT scanner on Earth. Nothing in RGB is calibrated like this; “how blue is the sky” is not a physical constant. You can and should exploit it — windowing, HU-based preprocessing, and physically motivated augmentations all follow from it.

The suspected disease localizes attention. Clinical imaging arrives with a reason for exam. “Rule out pneumothorax” tells you to look at the pleural line; “rule out stroke” sends you to the brain parenchyma and vessels. The organ of interest is usually known, which is a luxury object detection never has.

But each of these advantages has a barb:

The canonical pose breaks for portable/supine films, pediatric patients, body habitus, and post-surgical anatomy.
HU calibration drifts with scanner, kVp, and contrast timing (more on this below), and MRI intensities are not standardized at all — a T1 value is only meaningful relative to the rest of that one acquisition.
“The organ of interest is known” is a trap: incidental findings in the other organs are often what matter most clinically. The lung-nodule model that ignores the adrenal mass at the edge of the field has failed the patient even if its AUC is perfect.

So: use the priors, but treat every one of them as a covariate that can shift.

The needle in the haystack: subtlety and extreme imbalance

Here is the single biggest statistical difference from natural images. In COCO, the object you care about typically occupies a meaningful fraction of the frame. In radiology, the finding is often a handful of voxels in a sea of normal tissue, and the difference between malignant and benign — between call the patient back and see you in two years — can come down to a few millimeters of spiculation or a subtle change in density.

Make it concrete with geometry. A chest CT of roughly $512 \times 512 \times 320$ voxels at $0.7 \times 0.7 \times 1.0\,\text{mm}$ contains about $8.4 \times 10^7$ voxels. A clinically important $5\,\text{mm}$ pulmonary nodule is a sphere of volume $\tfrac{4}{3}\pi r^3 \approx 65\,\text{mm}^3$, or about $134$ voxels. The lesion is therefore

\[\frac{134}{8.4\times 10^7} \approx 1.6 \times 10^{-6}\]

of the volume — roughly one in six hundred thousand voxels. Shrink it to a $3\,\text{mm}$ nodule and you are at one in three million. Figure 1 puts several findings on the same axis as natural-image objects; note the five-to-six order-of-magnitude gap.

The consequences for an ML scientist are direct:

Accuracy is meaningless and pixel-wise loss is treacherous. A segmentation model that predicts “no lesion” everywhere achieves $1 - 1.6\times10^{-6} \approx 99.9998\%$ voxel accuracy. Use overlap and detection metrics built for imbalance — Dice / $F_1$, where for prediction $P$ and ground truth $G$, $\mathrm{Dice} = \frac{2|P \cap G|}{|P| + |G|},$ free-response ROC (FROC) for detection, and class-balanced or region-based losses (Dice loss, Tversky, focal). The focal loss down-weights the easy negatives that otherwise dominate the gradient: $\mathrm{FL}(p_t) = -(1-p_t)^{\gamma}\log p_t$.
Most of the volume is uninteresting, and uninteresting in a structured way. Hard-negative mining, lesion-aware patch sampling, and two-stage candidate-then-classify pipelines exist because uniformly sampling voxels wastes almost all of your compute on obvious lung parenchyma.
Resolution is not negotiable. Downsampling a natural image to $224^2$ loses a cat’s whiskers; downsampling a CT slice can erase the lesion entirely. The signal you are hunting may be at the Nyquist limit of the acquisition.

Annotation is the bottleneck, not the model

In natural-image land, labels are cheap: crowdworkers draw boxes, and “is this a dog” needs no credential. Radiology inverts this completely, and it reshapes what is feasible.

A bounding box is the wrong primitive, and often impossible. Many findings have no crisp boundary. Where exactly does a ground-glass opacity end and normal lung begin? What is the bounding box of diffuse interstitial disease, or of “the lungs look hyperinflated”? The pathology is frequently a texture or a global property, not a localizable object. Even when a lesion is discrete, it lives in 3D — a box becomes a volume, and a radiologist scrolling 320 slices to contour a tumour is spending clinical time that costs orders of magnitude more than a crowdworker.

Ground truth is noisy and sometimes unobtainable from the image alone. The honest label often is not in the pixels. Is that lung nodule malignant? The image cannot say; you need the biopsy, or two years of follow-up showing growth. This is why so many “labels” in public datasets are actually NLP-extracted from the radiology report (MIMIC-CXR, CheXpert, ChestX-ray14, PadChest all do this) — which means your labels inherit both the radiologist’s error rate and the text-mining model’s error rate.

Inter-reader variability is a hard ceiling. Radiologists disagree. The LIDC-IDRI lung-nodule database was annotated by four thoracic radiologists precisely because no single read is ground truth; of 2,669 lesions marked as nodules $\geq 3\,\text{mm}$ by at least one reader, only about 35% were marked by all four. If your “ground truth” is one radiologist, your evaluation noise floor may be larger than the improvement you are claiming. Model the labels as noisy: capture annotator agreement (e.g. Cohen’s / Fleiss’ $\kappa$), train against multi-reader consensus where you can, and report performance relative to the inter-reader band, not to an imagined perfect oracle.

You cannot read the data without domain knowledge. A computer-vision engineer can sanity-check an ImageNet pipeline by eye. Almost no ML scientist can look at a FLAIR hyperintensity and tell whether the label is right. This has a practical implication that teams underestimate: you need a radiologist in the loop continuously, not just at the start, because data-cleaning decisions (which views to keep, how to handle priors, what counts as positive) are clinical judgments in disguise.

The data scarcity problem

Natural-image research rides on ImageNet ($1.4$M images), and webscale sets in the billions. Radiology has nothing remotely comparable that is public, and the reasons are structural: images are protected health information, they must be de-identified (including burned-in pixel annotations and faces reconstructable from head CT/MRI), and the expert labels are expensive. What we do have is a handful of landmark public collections, summarized in Table 1.

Table: Major public medical-imaging datasets. “Images” counts vary by modality (a CT/MRI “study” is a 3D volume of many slices). Sizes are as reported by the source publications.

Dataset	Modality	Scale	Notes
TCIA (The Cancer Imaging Archive)	CT/MR/PET, many	Umbrella of 100+ collections	The host for most public oncology imaging, incl. LIDC-IDRI, BraTS sources
MIMIC-CXR	Chest X-ray	377,110 images / 227,835 studies / 65,379 patients	Single US center; paired free-text reports
CheXpert	Chest X-ray	224,316 images / 65,240 patients	Stanford; 14 NLP-mined labels with uncertainty
ChestX-ray14 (NIH)	Chest X-ray	112,120 images / 30,805 patients	14 labels mined from reports
PadChest	Chest X-ray	160,868 images / ~67,000 patients	Spanish; 174 findings, multi-view
LIDC-IDRI	Chest CT	1,018 scans	4-radiologist nodule annotations
BraTS / TCGA glioma	Brain MRI (4 sequences)	hundreds of cases	Expert tumor segmentations; the benchmark for glioma
RSNA ICH	Head CT	>25,000 exams	Intracranial hemorrhage, 60+ radiologist labelers
EMBED	Mammography (2D/DBT)	3.4M images / ~110,000 patients	Racially balanced; 20% public via AWS
fastMRI	Knee/brain MRI	>1,500 knee + ~7,000 brain raw studies	Raw k-space — for reconstruction research
UK Biobank imaging	Whole-body MRI/DXA	100,000 participants	Population cohort, healthy-skewed; access-controlled

Two things to internalize. First, the largest labeled sets are 2D chest radiographs, because they are the cheapest to acquire and the easiest to label from reports; 3D, multi-sequence, and rarer-modality data are one to three orders of magnitude smaller. Second — and this is the setup for the rest of the post — a big total $N$ is not the same as a big $N$ where it counts. EMBED has 3.4M images, but if you want to evaluate performance for, say, architectural distortion in dense breasts of women under 40 scanned on one vendor’s tomosynthesis unit, you are suddenly working with a few dozen cases.

Heterogeneity and generalization: the part everyone underestimates

Everyone says medical-imaging AI “doesn’t generalize.” Fewer people say why, mechanistically. The reason is that a medical image is the output of a long physical and human pipeline, and every stage of that pipeline is a covariate that differs across hospitals. A natural image has confounders too (lighting, camera), but nothing like this stack.

Formally, the trouble is distribution shift. Your model learns $P_{\text{train}}(Y \mid X)$ over inputs drawn from $P_{\text{train}}(X)$, and is deployed where both can differ:

\[P_{\text{train}}(X, Y) \;\neq\; P_{\text{test}}(X, Y).\]

Decompose it. Covariate shift is $P(X)$ changing while $P(Y\mid X)$ holds — a different scanner renders the same pathology with different texture. Label shift is $P(Y)$ changing — disease prevalence differs across a referral center and a screening clinic, which (via Bayes) moves every predicted probability and every PPV even if the imaging is identical. Concept shift is the genuinely dangerous one, $P(Y\mid X)$ itself changing — the imaging appearance of a disease differs by population, or the label definition differs by institution. Here is the catalogue of what actually shifts:

Scanner vendor and model. GE, Siemens, Philips, Canon detectors and reconstruction software impose vendor-specific texture and noise signatures. Models readily learn the scanner, not the disease.
Acquisition physics. CT: tube voltage (kVp), tube current (mAs), pitch, slice thickness, and especially the reconstruction kernel (sharp vs. smooth) dramatically change texture — reconstruction kernel alone can render the majority of radiomic features non-reproducible across settings. MRI: field strength (1.5T vs 3T), pulse sequence and vendor implementation, TR/TE, and the fact that intensities are not standardized at all.
Contrast and timing. With vs. without IV contrast, and when in the contrast bolus the scan was captured, can change a structure’s appearance more than disease does.
Imaging noise and dose. Low-dose protocols (and the shift toward them) raise quantum noise; denoising and dose vary by site and by patient size.
Patient demographics and disease spectrum. Age, sex, body habitus, ancestry, comorbidity mix, and disease prevalence and severity all vary by catchment. A model tuned where pneumothoraces are large and obvious degrades where they are small and subtle.
Protocol and positioning. Portable vs. fixed units, supine vs. upright, inspiration depth, pediatric protocols, post-surgical hardware.

The canonical demonstration is Zech et al. (2018): CNNs trained to detect pneumonia on chest radiographs generalized worse to outside hospitals than internal test performance suggested, and the models had learned to detect the hospital system and even the department — exploiting that a portable scanner marker or a prevalence difference correlated with disease. The same pattern shows up in segmentation: AlBadawy et al. (2018) found glioma-segmentation performance dropped measurably when training and test institutions differed. This is shortcut learning, and it is rampant precisely because the spurious features (scanner, view, burned-in markers) are so predictable.

What this means for your workflow:

Internal test performance is an upper bound, not an estimate. The only trustworthy evaluation is external — a held-out site, ideally a held-out vendor and time period. Split by hospital, not by image.
Audit for shortcuts. Saliency maps that point at the corner marker, an AUC that survives when you black out the anatomy, a model that can classify scanner from the image — all are red flags.
Harmonize deliberately. Intensity normalization, resampling to common spacing, vendor-aware augmentation, and even learned kernel/stain-style conversion exist to fight covariate shift; use them, but verify they did not erase the signal.

The statistical-power trap, in numbers

Now combine the previous two sections — heterogeneity and scarcity — and you get the quietest failure mode in the field. To prove a model generalizes, you must evaluate it in each clinically relevant subgroup. But every stratification you add slices your sample, and because disease is rare, it is the positive cases that vanish first.

Walk it down for a chest-radiograph model, anchored to MIMIC-CXR’s 377,110 images (Figure 2). Keep frontal views only ($\times 0.65$). Keep the positives for your target finding — pneumothorax, prevalence $\approx 3\%$ ($\times 0.03$); already you are at ~7,000 positive cases, not 377,110. Now ask the generalization questions clinicians will ask: how does it do in women ($\times 0.47$), specifically those aged 18–40 ($\times 0.16$), specifically scanned on vendor B ($\times 0.30$), specifically with the moderate-to-large, actionable subtype ($\times 0.40$)? You land on about 66 positive cases. From 377,110 to 66 — and 66 is the number that actually governs what you can conclude about that subgroup.

Why 66 is a problem is pure sampling theory. Estimate a subgroup sensitivity (true positive rate) $\hat{p}$ from $n$ positive cases; its standard error is $\sqrt{p(1-p)/n}$, so the 95% confidence half-width is about

\[1.96\sqrt{\frac{p(1-p)}{n}}.\]

At a true sensitivity of $0.85$ and $n = 66$, that half-width is $\pm 0.086$: your estimate is “somewhere between $0.76$ and $0.94$.” You cannot distinguish a clinically excellent $0.90$ from a borderline $0.78$. (For small $n$ use the Wilson interval rather than this normal approximation — the qualitative story is the same, and at these counts it matters.) Figure 3a shows the half-width shrinking only as $1/\sqrt{n}$; the subgroup strata are marked.

Worse, suppose you want to detect a real subgroup gap — say sensitivity drops from $0.85$ overall to $0.75$ in young women on vendor B. The number of positives per group needed for a two-sided test at $\alpha = 0.05$ with power $1-\beta$ is

\[n = \frac{\left(z_{1-\alpha/2}\sqrt{2\bar{p}(1-\bar{p})} + z_{1-\beta}\sqrt{p_1(1-p_1)+p_2(1-p_2)}\right)^2}{(p_1 - p_2)^2},\]

which for $p_1=0.85,\, p_2=0.75$ works out to about 250 positive cases per group for 80% power. Your subgroup has 66, which buys roughly 30% power (Figure 3b): a two-in-three chance of missing a real, clinically meaningful degradation. And if you honestly test across, say, ten subgroups, a Bonferroni correction to $\alpha = 0.005$ pushes the requirement to ~425 per group — while simultaneously, not correcting means some of your “significant” subgroup findings are noise. You are squeezed from both sides.

The lesson is not “give up.” It is to plan evaluation as a power calculation from day one: decide which subgroups are non-negotiable, estimate the positive counts you will actually have, and either acquire enough cases (often via multi-site collaboration) or state honestly which subgroups you are not powered to certify. Silent truncation — reporting one headline AUC computed over a population you never stratified — is how models that look published-ready fail in deployment.

How these models are actually regulated

If your model will touch patient care in the US, it is almost certainly a medical device, and the FDA’s framework shapes your engineering. A few facts ML scientists are routinely surprised by:

Radiology dominates. From the 1990s through the mid-2020s, roughly three-quarters of all FDA-authorized AI/ML-enabled devices are in radiology — by far the largest category. This is your field.
Almost everything clears via 510(k), not clinical trials. The dominant path is the 510(k), which establishes “substantial equivalence” to a legally marketed predicate device — not a randomized trial. (Genuinely novel devices use the De Novo path; the highest-risk ones need full premarket approval, PMA, which is rare for imaging AI.) A consequence: fewer than a third of FDA-authorized radiology AI devices have published prospective clinical testing. Substantial equivalence is a regulatory claim, not evidence your model helps patients — keep those separate in your head.
Models had to be “locked.” Historically the FDA cleared locked algorithms — same input, same output, no learning in the field — because a continuously adapting model breaks the entire premarket paradigm.

What changed recently is worth knowing, because it directly affects how you can plan model updates. In December 2024 the FDA finalized guidance on the Predetermined Change Control Plan (PCCP). The idea: in your original submission, you pre-specify what you will be allowed to change (e.g. retrain on new sites, recalibrate a threshold), the methodology you will use to develop and validate each change, and an impact assessment — and then you can ship those pre-authorized modifications without a new marketing submission. For an ML scientist this is the bridge from “frozen forever” toward “responsibly updatable,” and it explicitly asks you to think up front about intended-use populations (ethnicity, sex, disease severity) and deployment environments. In practice it means your monitoring and revalidation plan is part of the product, not an afterthought.

The academic model is not the deployed model

Finally, the gap that ends the most promising projects. The model in the paper and the model in the hospital are different artifacts, optimized against different objectives.

Dimension	Academic / benchmark model	Deployed clinical model
Objective	Maximize AUC/Dice on a fixed test set	Improve a clinical workflow at a fixed, safe operating point
Metric that matters	Discrimination (AUROC)	Sensitivity/specificity at a chosen threshold; calibration; PPV at local prevalence
Data	Curated, deduplicated, clean labels	Messy PACS feed: priors, wrong views, artifacts, truncation
Generalization	Random split, often single site	Must hold across vendors, sites, time, demographics
Failure cost	A lower number in a table	A missed cancer or a false alarm that fatigues the radiologist
Lifecycle	Frozen at publication	Monitored, drifts, must be revalidated and re-cleared
Integration	A `.ipynb` and a checkpoint	DICOM in/out, PACS + reporting integration, latency budget, audit trail

Concretely, what bites teams crossing this gap:

Operating point, not the whole curve. A clinician runs your model at one threshold. A great ROC curve with no defensible, calibrated operating point is not deployable. And because prevalence differs by site (label shift), the threshold that gives the right PPV in your lab is wrong in the clinic; plan to recalibrate, e.g. with Platt scaling or isotonic regression, per site.
The long tail is the job. Benchmarks delete the ambiguous and corrupted cases that dominate a real PACS queue. In deployment those are the workload: the lateral mistakenly sent as frontal, the patient with prior surgery, the motion-degraded study. Your model needs a calibrated “I don’t know.”
Prospective $\neq$ retrospective. Retrospective AUC routinely overstates prospective performance; the few prospective and randomized radiology-AI studies have repeatedly come in below their retrospective hype.
Automation bias and workflow effects. A deployed model changes radiologist behavior — sometimes it catches misses, sometimes it anchors the reader to a wrong call. The endpoint that matters is reader + model, not the model in isolation.
Drift and monitoring. Scanners get replaced, protocols change, populations shift. A model that was validated in 2024 is not automatically valid in 2027. The PCCP framework above exists precisely because this drift is inevitable.

Takeaways

If you remember five things moving from natural images to radiology:

Exploit the priors, distrust them. Canonical pose, calibrated intensities, and a known organ of interest are real gifts — but each is a covariate that shifts, and the finding may be in the organ you weren’t told to look at.
Your signal is a needle. Lesions are $10^{-7}$–$10^{-5}$ of the image. Abandon accuracy and pixel-wise loss; use detection/overlap metrics, imbalance-aware losses, and lesion-aware sampling, and don’t downsample away the disease.
Labels are the bottleneck. They are expensive, noisy, NLP-mined, and bounded by inter-reader disagreement. Keep a radiologist in the loop and model the label noise explicitly.
Generalization is the whole game. Split by site/vendor/time, hunt for shortcuts, and treat internal test numbers as upper bounds.
Power your evaluation before you train. Stratification destroys positive counts; decide which subgroups you can certify, and say so honestly. Then remember the deployed model lives at one calibrated operating point, under FDA rules, drifting over time — design for that from the start.

See the accompanying notebook.ipynb for the geometry, the stratification waterfall, the power calculations behind Figures 1–3, and an automated check that every citation below resolves.

References

Clark K, Vendt B, Smith K, et al. The Cancer Imaging Archive (TCIA): maintaining and operating a public information repository. J Digit Imaging. 2013;26(6):1045–1057. doi:10.1007/s10278-013-9622-7
Johnson AEW, Pollard TJ, Berkowitz SJ, et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports. Sci Data. 2019;6:317. doi:10.1038/s41597-019-0322-0
Irvin J, Rajpurkar P, Ko M, et al. CheXpert: a large chest radiograph dataset with uncertainty labels and expert comparison. Proc AAAI. 2019;33(01):590–597. doi:10.1609/aaai.v33i01.3301590
Wang X, Peng Y, Lu L, et al. ChestX-ray8: hospital-scale chest X-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. CVPR. 2017:3462–3471. doi:10.1109/CVPR.2017.369
Bustos A, Pertusa A, Salinas J-M, de la Iglesia-Vayá M. PadChest: a large chest x-ray image dataset with multi-label annotated reports. Med Image Anal. 2020;66:101797. doi:10.1016/j.media.2020.101797
Armato SG III, McLennan G, Bidaut L, et al. The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI): a completed reference database of lung nodules on CT scans. Med Phys. 2011;38(2):915–931. doi:10.1118/1.3528204
Bakas S, Akbari H, Sotiras A, et al. Advancing The Cancer Genome Atlas glioma MRI collections with expert segmentation labels and radiomic features. Sci Data. 2017;4:170117. doi:10.1038/sdata.2017.117
Menze BH, Jakab A, Bauer S, et al. The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS). IEEE Trans Med Imaging. 2015;34(10):1993–2024. doi:10.1109/TMI.2014.2377694
Knoll F, Zbontar J, Sriram A, et al. fastMRI: a publicly available raw k-space and DICOM dataset of knee images for accelerated MR image reconstruction using machine learning. Radiol Artif Intell. 2020;2(1):e190007. doi:10.1148/ryai.2020190007
Jeong JJ, Vey BL, Bhimireddy A, et al. The EMory BrEast imaging Dataset (EMBED): a racially diverse, granular dataset of 3.4 million screening and diagnostic mammographic images. Radiol Artif Intell. 2023;5(1):e220047. doi:10.1148/ryai.220047
Littlejohns TJ, Holliday J, Gibson LM, et al. The UK Biobank imaging enhancement of 100,000 participants: rationale, data collection, management and future directions. Nat Commun. 2020;11:2624. doi:10.1038/s41467-020-15948-9
Zech JR, Badgeley MA, Liu M, et al. Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: a cross-sectional study. PLoS Med. 2018;15(11):e1002683. doi:10.1371/journal.pmed.1002683
AlBadawy EA, Saha A, Mazurowski MA. Deep learning for segmentation of brain tumors: impact of cross-institutional training and testing. Med Phys. 2018;45(3):1150–1158. doi:10.1002/mp.12752

Reproduce all analyses in this post here.

Except in situs inversus (~1 in 10,000), which is exactly the kind of rare but catastrophic edge case a model trained on the canonical prior will get confidently wrong. Hold that thought; it returns under heterogeneity. ↩

How This Blog Is Built: A Reproducible Pipeline for Scientific Writing

2026-06-01T00:00:00+00:00

Why a blog deserves a build system

Most of what I write here makes a quantitative claim, and a quantitative claim is only as trustworthy as the analysis behind it. In a paper, the apparatus that makes a result believable — version control, a pinned environment, a test that re-runs the analysis end-to-end — lives off to the side, in a supplement nobody reads. I wanted the blog to put that apparatus first. Every figure here should be regenerable from a notebook, every notebook should run in a known environment, and every claim that survives to the published page should have passed a test on the way there.

That goal sounds heavy, but the day-to-day is the opposite. My entire workflow as an author is three commands:

# write posts//main.md, then:
git add -A
git commit -m "Add post: ..."
git push

Everything downstream — rendering the PDF, publishing the web page, running the analysis, testing that it all still works — is automated. This post is a tour of that automation: what each piece does, why I chose it, and how they compose into the pipeline in Figure 1.

/main.md` plus its notebook — fans out to a PDF, a tested analysis, and a live web page. The top lane is everything I touch by hand; the bottom lane is automatic." />

The author’s-eye view: one source of truth

Each post is a self-contained directory:

posts//
  main.md            # the article: Markdown + YAML front matter + LaTeX math
  notebook.ipynb     # the analysis that generates every figure
  figures/           # generated plots (git-ignored)
  scripts/           # plotting / analysis code, runnable standalone
  data/              # datasets + a README describing each source (git-ignored)
  environment.yml    # the conda environment for *this* post
  Dockerfile         # a container that reproduces *this* post

The article itself is a plain Markdown file. Prose is Markdown; math is LaTeX, delimited by $$…$$ for inline symbols and $$…$$ for display equations. So a sentence can carry a real claim — for a diagnostic test with sensitivity $\mathrm{Se}$ and specificity $\mathrm{Sp}$ applied to a population with disease prevalence $\pi$, the post-test probability of disease given a positive result is

\[\Pr(D^{+}\mid T^{+}) \;=\; \frac{\mathrm{Se}\,\pi}{\mathrm{Se}\,\pi + (1-\mathrm{Sp})(1-\pi)},\]

— and that same source file renders to a typeset PDF and to a web page, with the math intact in both. Writing in Markdown rather than HTML or a CMS means the post is diffable, greppable, reviewable in a pull request, and outlives any particular renderer.

I write the posts in Markdown rather than full LaTeX for the same reason: a blog post is prose with the occasional equation, not a precisely typeset document. I don’t need fine control over page breaks, floats, and layout here — I need to get words and math down quickly and let the site’s theme handle how they look. Markdown keeps the source close to the rendered blog layout and stays readable on its own. When I write a manuscript, where layout, figure placement, and typesetting precision actually matter, I reach for LaTeX in Overleaf instead; Markdown is the right altitude for a blog, LaTeX for a paper.

The rest of the directory exists so that the numbers in that prose are defensible. The notebook produces the figures; the scripts hold any analysis worth reusing; the environment.yml and Dockerfile pin exactly what it takes to run them. Nothing in the published page is hand-drawn or hand-typed from a result I can’t reproduce.

The website: Jekyll, a theme, and Vercel

Jekyll for the site, academicpages for the theme

The site is a Jekyll static site. Jekyll turns a folder of Markdown into a fast, dependency-free set of HTML pages, and — the reason I chose it over a hand-rolled framework — it has a deep ecosystem of ready-made themes. I use academicpages, a fork of Minimal Mistakes built for academics, trimmed down to the three things I actually need: an About page, a Blog with tag filtering, and a Publications list. Because the publications list can be generated from a BibTeX export of my Google Scholar profile, the academic furniture of the site maintains itself.

A static site is the right tool here for the same reason a simple model often beats a complex one: there is no server to run, no database to corrupt, no attack surface to patch. The output is just files.

I started out on WordPress, which is a capable platform — but for a blog that is really a pile of version-controlled text and notebooks, a database-backed CMS was more moving parts than the job called for. Switching to a static site let the writing live in the same Git repository as the analysis, diffable and reviewable alongside the code, with nothing to keep patched or running between posts.

Vercel for hosting

Those files are served by Vercel. I point Vercel at the GitHub repository, set the root directory to site/, and it does the rest: on every push to main it runs bundle exec jekyll build and deploys the result to joseph-rich.com behind a global CDN, with HTTPS and the custom domain handled for me. There is no deploy step in my workflow — “deploy” is “merge to main.”

The appeal is simplicity. I never think about web servers. A push becomes a live site in under a minute, every pull request gets its own preview URL so I can see a draft exactly as it will appear before it goes public, and committing Gemfile.lock keeps Vercel’s build byte-for-byte reproducible against my local one.

The domain name itself lives at Cloudflare, which is my registrar and DNS provider; Cloudflare’s nameservers simply point joseph-rich.com at Vercel. Keeping the domain deliberately separate from the host buys two things. First, Cloudflare registers domains at wholesale cost with no markup and includes WHOIS privacy for free, so the registration is cheap and my contact details stay out of the public record. Second, the domain isn’t captive to any one platform: because DNS lives with the registrar rather than the host, I can repoint joseph-rich.com at a different provider by editing a single record, with no migration and no downtime. The host is replaceable; the address is mine.

giscus for comments

Comments are powered by giscus, which stores each discussion thread in this repository’s GitHub Discussions. I chose it for three reasons:

It’s built on GitHub. The comments live next to the code, in the same account that already hosts everything else — no third-party comment database to own or migrate.
It requires a GitHub login. Commenting means authenticating with GitHub, which by itself filters out essentially all drive-by spam. The barrier is low for the technical audience this blog is written for and high for bots.
No ads, no tracking, free. Unlike hosted comment widgets, giscus serves no advertising and sells no data. It’s an open-source script talking to the GitHub API.

Setup is a one-time affair: enable Discussions, install the giscus GitHub app, and drop the repository and category IDs into the Jekyll config.

The analysis: notebooks you can actually re-run

Jupyter for the figures, Colab for zero-install access

Every figure starts life in a Jupyter notebook. The notebook is the interactive workbench — load the data, fit the model, plot it, see the result inline, iterate — and it doubles as the record of how each figure was made. Crucially, the notebook writes the figures into figures/, so the article and the analysis can never silently drift apart: regenerate the figure and the post updates.

For readers who don’t want to install anything, each notebook also opens directly in Google Colab from a badge at the top. A curious reader can re-run my analysis in their browser, change a parameter, and watch the figure move — no local setup at all. Interactivity is the point: a static PNG asserts a result; a runnable notebook lets you check it.

conda and Docker: one environment per post

“It runs on my machine” is not reproducibility. Each post therefore pins its own environment two ways:

conda. A per-post environment.yml lists exact versions of Python and every library the notebook imports. The environment is named after the post, so posts never share a dependency set. A two-year-old post can pin an old numpy while a new one uses the latest, and neither breaks the other.
Docker. A per-post Dockerfile builds that conda environment inside a container and registers it as a Jupyter kernel, so the notebook runs identically on any machine with Docker — no conda required, nothing touching the host.

Isolating environments per post is deliberate. A single shared environment is a slow-motion dependency crisis: every new library risks an upgrade that quietly changes an old figure. Per-post environments make each article a sealed unit that reproduces on its own, indefinitely.

Quality control: tests, CI, and a publish hook

This is the part most personal sites skip, and it’s the part I care about most. A blog that makes numerical claims should be tested like software that makes numerical claims.

pytest discovers and exercises every post

A pytest suite walks posts/, discovers every post automatically, and runs three independent checks against each one:

The PDF builds. main.md renders to PDF through pandoc and the Eisvogel LaTeX template. If an equation or a figure path is broken, this fails.
The notebook runs (lax). The notebook executes top to bottom and must complete without raising — using nbval in --nbval-lax mode, which ignores the stored outputs and only checks that nothing errors.
The notebook reproduces its outputs (strict). The notebook re-runs and each cell’s output must match what’s committed, exactly.

Running both a lax and a strict notebook check is intentional, and it’s the diagnostic trick I’d most recommend borrowing. The two failures mean very different things:

A lax failure means the code is broken — an exception, a missing import, an API that changed under me.
A strict failure means the code still runs but the result moved — a new library version nudged a number, or a computation wasn’t as deterministic as I thought.

Separating “it crashed” from “the answer changed” turns a red checkmark into an actual diagnosis. Genuinely non-deterministic cells (timestamps, random draws, plot objects) are marked to be ignored by the strict check, so a strict failure is always a real signal, never noise.

GitHub for version control, GitHub Actions for CI

The whole repository lives on GitHub, which gives me version history, pull requests, and Discussions (the same Discussions that back the comments). On top of that, GitHub Actions runs the entire pytest suite — PDF builds and both notebook checks, across every post — automatically on every push and every pull request. It spins up a clean Ubuntu machine, installs the conda environments and a LaTeX toolchain from scratch, regenerates the figures, and runs the tests. Because the runner starts empty, “passes in CI” means “reproduces on a machine that has never seen my files” — exactly the property I want.

The payoff is that I can’t quietly ship a broken post. If a notebook stops running or a figure stops reproducing, the check goes red before anything reaches the site.

Feature branches keep the live site stable

New posts are written on a feature branch, never on main. Vercel only deploys main, so a half-finished draft can be committed, pushed, and run through CI as many times as I like without ever touching the public site. When the branch is green and the writing is done, I merge to main — and that merge is what publishes. The branch is the draft; main is print.

A pre-commit hook publishes automatically

The bridge from posts//main.md to a Jekyll page is a committed pre-commit hook. On every commit it runs a small script (sync_posts.py) that:

maps the post’s front matter into the Jekyll format the theme expects,
copies the referenced figures into the site’s image folder and rewrites the paths,
translates inline $$…$$ math into the $$…$$ form the site’s MathJax renders, and
appends a footer linking back to the post’s source folder on GitHub, so any reader can reproduce the analysis.

Because this runs at commit time, the website copy is always in sync with the authoritative main.md — I never edit the published page by hand, and I can never forget to. Authoring and publishing collapse into a single git commit.

Details that keep the repository rigorous

A few smaller choices do disproportionate work.

Citations have to resolve. I manage references in Zotero, which keeps a single library of everything I’ve cited across posts and papers and exports clean BibTeX on demand. Before I reference a paper, I check its DOI against doi2bib (https://doi2bib.org/bib/); if the DOI doesn’t return a valid bib entry, the citation doesn’t go in. It’s a cheap, mechanical guard against the broken or imaginary references that creep into informal writing — and the final notebook cell of a data-driven post re-checks that every DOI still resolves.

Data and figures are git-ignored. The repository tracks code and prose, not the artifacts they produce. Generated figures and downloaded datasets are excluded from version control, which keeps the repo small and fast to clone and avoids committing large or redistribution-restricted files. CI regenerates the figures from the notebooks before testing, so nothing is lost — the recipe is versioned, the output is disposable. (The data/README.md still documents every source and its license, so the provenance survives even though the bytes don’t.)

Lean for proofs that have to be right. When a post leans on a piece of mathematics I want to be certain of — not just plausible — I can formalize it in Lean, a proof assistant that mechanically verifies each step. Most posts never need it, but for a subtle inequality or a correctness argument it’s the difference between “I checked it carefully” and “a theorem prover checked it.”

The whole loop, in one breath

Put together, the system means my job is to write. I open main.md, write prose and equations in Markdown and LaTeX, build the figures in a notebook, and then commit and push. From there:

the pre-commit hook converts the post into a web page and stages it;
GitHub stores the history and opens the pull request;
GitHub Actions rebuilds the PDF and re-runs every notebook, lax and strict, on a clean machine;
once it’s green I merge to main, and
Vercel builds the Jekyll site and deploys it to joseph-rich.com.

No manual build, no manual deploy, no copy-paste into a CMS, and — most importantly — no published claim that hasn’t survived a test. The blog is held to the same standard as the research it describes: reproducible, version controlled, and continuously verified. That’s the whole point. If a result is worth publishing, it’s worth being able to run again.

Recommendations, if you’re building something similar

A few things I’d tell a past version of myself:

Pick a static site with a theme ecosystem. The fastest path to a site you won’t fight is a mature theme on Jekyll, Hugo, or Astro. Don’t hand-roll the CSS for a blog.
Write in Markdown, not in your CMS. Plain text files are diffable, reviewable, and portable across renderers. Your words should outlive your tooling.
Test the analysis, not just the prose. The lax/strict split is worth adopting wholesale: it tells you whether your code broke or your answer moved, which are different problems with different fixes.
Isolate environments aggressively. One pinned environment per post (or per project) is the cheapest insurance against bit-rot you can buy.
Make publishing a side effect of committing. A commit hook plus a push-to-deploy host removes the two steps most likely to go stale: the manual build and the manual upload.
Use branches as drafts. Gating deploys on main lets you commit freely, run CI repeatedly, and publish only when you mean to.

None of these pieces is exotic. The leverage is in composing them so that the boring, error-prone work — building, deploying, testing, keeping copies in sync — happens on its own, and the only thing left for me to do is the part that actually matters: the writing.

Reproduce all analyses in this post here.