No background in biology or statistics required. This deck walks through one idea at a time: how to describe a dividing cell in the language of probability, and how to ask data whether that description holds up — before any results are in.
Put a single bacterium under a microscope and watch: it grows longer, then splits into two smaller cells. Each of those grows and splits again. Followed over time, this produces a lineage — a family tree of cells, where every branch is one cell cycle: the stretch from one cell's birth to the moment it divides.
Four numbers describe each cell cycle in the data we'll eventually fit against:
In words: a cell is born at some mass mb, grows at rate α, and after a duration t splits into two daughters. One daughter keeps a fraction f of the mother's final mass; the other keeps the rest, 1−f.
A cell has no clock. So how does it "know" when to divide? One long-standing idea, the accumulator hypothesis, says: some internal quantity — call it protein, or just "stuff" — builds up as the cell grows. Once enough of it has piled up, the odds of dividing shoot up.
Since a cell's length grows roughly exponentially, the natural stand-in for "how much has piled up" is simply how much bigger the cell has gotten since it was born:
Read it piece by piece: eαt is "how many times bigger the cell has gotten" after time t. Subtract the 1 it started with, and multiply by its birth mass, and p(t) is just the raw amount of new growth — nothing exotic, just size added since birth.
To turn "risk of dividing" into math, we borrow a toolkit built for any "time until an event" question. Two functions matter:
Survival, S(t): the probability the cell has not yet divided by time t. It starts at S(0) = 1 and falls toward 0.
Hazard, h(t): the risk of dividing right now, given it hasn't divided yet. Not "what fraction of all cells ever divide at time t" — but "of the cells still standing at time t, what fraction divide in the next instant."
Know the hazard, and survival follows automatically. Multiply them together and you get something new — the division-time density, written fdiv(t). It answers: "how probable was it, exactly, that this cell divided at this precise moment t?"
Remember the name fdiv(t) — it reappears, unchanged, a few slides from now as the actual number we compute for every real cell in the dataset.
Our hypothesis (slide 3) says the real driver isn't time directly — it's the accumulated growth, p. So we write survival as a function of p instead, using a flexible curve shape that starts near 1 (safe) and drops toward 0 (division near-certain) once p passes some threshold:
Survival alone only tells half the story — the coral dashed curve in the figure is the hazard, h(p): the instantaneous risk of dividing at accumulated growth p, given the cell hasn't divided yet. It falls straight out of S(p) by differentiation, and it's exactly what powers the "risk right now" curve from a few slides back:
u (scale): roughly, how much growth is typically needed before division becomes likely.
v (sharpness): how abruptly risk switches on near that point — big v means "safe... then division!" small v means a slow ramp.
ω2 (flexibility): a second knob that bends the curve independently of v, so the model isn't locked into one fixed shape.
All three are just numbers we don't know yet — the whole second half of this deck is about how we estimate them from data.
So far we've built a story for when a cell divides. But two more numbers from slide 2 — growth rate α and split ratio f — also wobble from cell to cell, even in an identical environment. We need a probability distribution for each: a shape describing which values are common and which are rare.
Growth rate α is always positive and tends to cluster around a typical value with a long right tail. A Gamma distribution — a flexible shape used for exactly this kind of "positive, skewed" quantity — fits naturally.
Split ratio f is always between 0 and 1 (a daughter can't get more than 100% or less than 0% of the mother's mass). A Beta distribution, built specifically to live on the interval from 0 to 1, is the natural fit.
The "~" reads as "is drawn from." Both distributions are described here by an everyday mean (μ) and standard deviation (σ) so the parameters stay easy to reason about.
Time for a detour. Flip a coin 10 times and see: H H T H H H T H H H — 8 heads, 2 tails. Pick any candidate value for θ, the coin's true chance of heads, and ask: "if θ were the truth, how probable was exactly this sequence?"
That number, computed for a given θ, is the likelihood. It is not the probability that θ is correct — it's a score for how well a candidate θ explains data we've already fixed and observed. Slide the control below and watch which values score highest.
A coin had one unknown (θ) and 10 data points. Our cell model has seven unknowns — u, v, ω2, μα, σα, μf, σf — and thousands of observed cell cycles. The mechanics don't change at all, only the scale.
For one cell cycle, three separate scores get multiplied together: how probable its division time was (that's fdiv(t) from slide 4), how probable its growth rate was (the Gamma), and how probable its split ratio was (the Beta):
Treat every cell cycle as independent, and the likelihood of the whole dataset is just all those numbers multiplied together. Because multiplying thousands of tiny fractions causes computer underflow, we take logarithms first, turning the giant product into a sum:
One candidate set of seven numbers in → one single score out. That score is what the rest of this deck searches for the best (or most plausible) values of.
Before any fitting happens, each of the seven unknowns gets an allowed range — broad, and shaped loosely around the data's scale, but flat inside it. Every value inside the range starts out equally plausible; nothing outside it is considered at all. This starting assumption is called a prior.
Here the prior is mainly just a fence: it rules out impossible values (a negative growth rate, for instance) but rarely changes which candidate looks better once you're inside the fence — that ranking job belongs almost entirely to the likelihood.
Think of the prior as drawing the boundaries of a map before exploring it. It says where you're allowed to look, not which spot is best.
Maximum Likelihood Estimation (MLE) searches for the single combination of seven numbers that scores highest on the likelihood. Picture searching a hilly landscape blindfolded: feel the ground's slope under your feet, and always take a step uphill. Keep going until the ground is flat in every direction — that's a peak.
A landscape can have more than one hill. A blind climber can get stuck on a small local hill and never find the tallest one nearby.
To avoid that trap, the search is run from several different starting points, and only the highest peak found across all of them is kept.
MLE hands back exactly one best guess, with no sense of how confident to be in it. Its real job here is narrower: give the next technique, MCMC, a sensible place to start looking.
MLE only finds one point. But we don't just want a best guess — we want to know how confident to be in it. MCMC (Markov Chain Monte Carlo) answers that by building up a whole cloud of plausible parameter combinations instead of a single one.
The idea: propose a small random jump from the current guess to a nearby one, and score it with likelihood × prior. A jump to a better spot is (almost) always taken. A jump to a worse spot is taken sometimes, in proportion to how much worse it is — never simply thrown away. Do this thousands of times, and the trail of visited points ends up dense exactly where the data says plausibility is high, and sparse where it's low.
A random walk like this needs some quality control before its output can be trusted as a fair map of plausibility. Three checks matter most:
The first stretch of steps is still shaking off wherever the walk started. Like letting your eyes adjust before really looking around a dark room — those early steps are run, then thrown away entirely.
Once settled, every step from here on is kept. Across 40 parallel walkers running for 3,000 steps each, that's 120,000 recorded parameter combinations.
Consecutive steps resemble each other, like footprints close together. Keeping only every 5th step trims some of that redundancy before summarizing.
After the walk finishes, we have a giant cloud of recorded points in the seven-dimensional parameter space. Look at just one parameter's values across that whole cloud — ignore the other six — and its histogram is the answer for that parameter: the posterior distribution, meaning "what we believe about this parameter after (post) seeing the data."
From that histogram, two summaries matter most:
The median — the middle value — becomes the single best point estimate.
The 95% credible interval — the range between the 2.5th and 97.5th percentile of all recorded values — supports a direct statement: "given this model and this data, there's a 95% chance the true value falls in this range."
We hypothesized an accumulator, wrote it as a survival curve, gave the other traits their own distributions, learned what "likelihood" means from a coin, set boundaries with a prior, and built two ways to search the resulting seven-dimensional space — one for a quick best guess, one for the full picture of uncertainty.
Next: running the real dataset through this pipeline, checking whether the fitted model's simulated lineages actually resemble the real ones, and reporting what the posterior says about u, v, ω2, and the rest.
We unleashed 40 walkers for 3,000 steps each, discarding the first 1,000 as burn-in. We did this across 9 different nutrient environments (sugars).
Looking at the Glucose (Glu) dataset as an example: the walkers accepted proposed jumps ~46.5% of the time — right in the sweet spot between getting stuck (too low) and aimless wandering (too high).
The "trace plot" on the right shows the paths of all 40 walkers over time. We want to see a dense, overlapping "fuzzy caterpillar" with no single walker drifting away from the pack. This confirms they have successfully mapped the core plausibility zone.
The "Corner Plot" projects our 7D cloud of samples down into 2D slices. The diagonal shows the final posterior histogram for each parameter. The off-diagonal contours show how parameters correlate.
Notice the tilted egg shapes? For example, u and v are strongly negatively correlated. If the model assumes a slightly higher growth threshold (higher u), it must compensate by making the threshold sharper (higher v) to still fit the data well.
Scale (u): 1.53 [1.51, 1.55]
Sharpness (v): 8.37 [8.06, 8.69]
Flexibility (ω2): 0.61 [0.57, 0.65]
We take the best-fit parameters and run the lineage simulator for thousands of generations. We then overlay the simulated histograms (orange) on top of the real biological data (blue). If our model is a good representation of biology, the shapes should overlap almost perfectly.
We ran this entire process for 9 different carbon sources. By plotting the peak (mode) values of the shape parameters against each other, we see how the cell's underlying logic adapts to nutrient quality.
There is a clear structure here. Sugars that support very fast growth (like Glucose and Sucrose) tend to cluster with lower threshold scales (u) and higher flexibility (ω2). Cells in slower, poorer environments (like Rhamnose or Lactose) push the threshold u higher, meaning they demand more relative accumulated growth before committing to division.
The autocorrelation function (ACF) measures how much a step in our MCMC chain depends on the steps before it. High autocorrelation means the chain is slow to explore the parameter space because consecutive proposals are closely related.
For a reliable MCMC run, the correlation should decay toward zero as the lag increases. A rapid drop indicates that the chain is collecting independent samples from the target posterior distribution.
A fast decay suggests a higher effective sample size (ESS), meaning the posterior estimates are based on well-mixed, independent chains.