The Big Soup Problem

Your faction may be reported at 48%, but it’s likely closer to that 52% faction at the top of the standings than the published gap suggests.

Every major Warhammer 40,000 results aggregator reports faction strength the same way. Take some window — a weekend, a month, a points-update cycle — count every game a faction played in it, count the wins, and divide. Total wins over total games. It is the universal unit of the competitive conversation, and it feels like the obviously correct one: a win rate is a rate, and more games can only sharpen it.

That intuition is sound, but it rests on an assumption — and the single-number form gives a reader no way to see that the assumption is there. When you pool every game into one fraction, you are treating each game as an independent draw from a single, stable meta — one global pot of competitive 40k that every event ladles from. Statisticians call this a fixed-effects assumption. It is a real modelling choice, and like every modelling choice it can be right, wrong, or somewhere in between. The trouble is that the pooled win rate never announces that the choice was made.

Metas come in clusters

The fixed-effects assumption asks you to believe that a game played at a US open weekend and a game played at a European GT are draws from the same underlying process. They usually aren’t. Regional list pools diverge — the armies that are popular in one scene are not the armies popular in another. Terrain conventions differ between circuits, and terrain is one of the largest levers on which factions thrive. The skill distribution of the room differs: a 350-player major and a 28-player RTT are not sampling the same population of players. None of this is measurement noise. It is real, structural variation between events — and pooling is built to pretend it isn’t there.

When the variation between events is real, the pooled number is still an average, but it is an average of things that were never the same quantity. It has a value. What it doesn’t have is the stability that its single-number form implies.

Where the confidence interval goes wrong

Here is the part that does the quiet damage. A confidence interval around a pooled win rate shrinks as games accumulate — that is the whole appeal of a large sample. A faction at 53% across 4,000 pooled games carries a much tighter interval than the same 53% across 200 games.

But that interval is only honest about one kind of uncertainty: the sampling noise within the pool. It answers the question “if this one big homogeneous meta really sits at 53%, how much would the observed number wobble?” It does not answer — cannot answer — “how much do events differ from each other?” If events are not exchangeable, the between-event variation is simply absent from the interval. The interval looks precise because it has been computed as though the messiest source of variation does not exist.

The consequence is direct: a pooled interval is, by construction, a floor on the true uncertainty, not an estimate of it. The honest interval is wider — sometimes a little, sometimes a lot — and the pooled report gives the reader no way to tell which.

Treating each event as a replicate

There is a standard alternative, and it is not exotic. It is the ordinary machinery of meta-analysis, the field built precisely around combining results from studies that were run in different places under different conditions.

Instead of one global rate, you assume the faction has a grand mean win rate, and each event has its own true rate scattered around that grand mean. The scatter has a size — the between-event standard deviation, conventionally written τ (“tau”). The honest summary of “how will this faction do” is then not a confidence interval over games but a prediction interval over events: given everything we’ve seen, where would the faction’s rate plausibly land at the next event? Two standard tools quantify the scatter directly — Cochran’s Q tests whether the between-event variation is larger than sampling noise alone would produce, and I² reports what fraction of the total variation is between-event rather than within.

This is still a modelling choice. Random-effects has its own assumptions, and they can also be wrong. The point is not that one method is correct and the other fraudulent. The point is that the two methods answer different questions, and only one of them is the question most readers think they are asking.

A worked example

Take a synthetic faction — call it Faction Alpha — across one points-update cycle with four large events. Suppose Alpha posts 43%, 49%, 55%, and 61% across the four, each event contributing seven to eight hundred games — about three thousand games in all.

Pool them and you get 52% over those three thousand games, with a confidence interval of roughly ±1.8 points: 52% ± 1.8, an interval that stays clear of 50%. The natural reading is “Alpha is an above-average faction, and we’re confident of it.”

Now look at the four events as four replicates. Their rates run from 43% to 61%, scattering around the 52% mean with a standard deviation of roughly eight points. At seven-hundred-odd games apiece, each event’s own rate is pinned down to within about ±4 points — so that scatter is not sampling noise, it is real, structural variation between the events. A plain prediction interval for the next event — two standard deviations either side of the mean — runs roughly 52% ± 15: somewhere from the high 30s to the high 60s.

Same data. Two summaries. The pooled view says “above 50%, and we’re confident.” The replicate view says “the central estimate is 52%, but the next event could reasonably hand Alpha a losing weekend or a dominant one, and a careful reader cannot rule either out.” The pooled “significantly above 50%” claim has quietly become “indistinguishable from 50% at the level of a single event.”

A constructed example will always look tidy. The effect it illustrates is not constructed: any time a faction’s per-event rates genuinely scatter, pooling reports a precision that the event-to-event behaviour does not have.

What this means for reading the standings

A significance test built on a pooled rate inherits the pooled rate’s assumption. When you read “Faction A is statistically ahead of Faction B,” the strength of that claim depends entirely on whether the events behind it were exchangeable — and a pooled report is structurally unable to show you whether they were. A claim resting on a pooled test is exposed, by the structure of the test, to exactly the between-event variation the pooling removed from view. That is not an accusation aimed at anyone. It is a property of the reporting unit.

So the takeaway is not “stop trusting aggregators.” Aggregators do careful, valuable work, and the pooled rate is a reasonable summary for many purposes. The takeaway is narrower and more useful: when two factions sit a few points apart in a pooled table, the gap between them is one of the least stable things on the page. The ranking is real; the precision of the ranking is mostly an artifact of the arithmetic.

One of the structural reasons events differ — terrain — is itself not fixed. The move to 11th edition is expected to push the competitive scene toward a more standardized terrain format, with more events sharing one set of conventions rather than spreading across several. If that consolidation happens, it would remove one of the reasons events vary from each other, and the fixed-effects assumption — pooling as though every game came from one meta — would rest on firmer ground than it does today.

That would not make pooling automatically safe. Regional list pools and the skill distribution of a given room still differ from event to event; terrain is only one input. And the effect itself is a hypothesis, not a result: whether a more uniform format actually shrinks between-event variation is an empirical question with an empirical answer. Once 11th-edition events accumulate, the same tools this article points at — Cochran’s Q, I² — measure it directly. If the between-event scatter shrinks, pooling earns back some of the trust this article has been spending; if it doesn’t, the format changed and the heterogeneity didn’t, which is worth knowing too. Either way, the methodology should follow the data rather than assume it.

The Archive’s job here is not to declare random-effects the one true method. “Big soup” — one global meta — is a defensible choice. “Localized metas” is also a defensible choice. They lead to different numbers and different intervals, and a reader deserves to know which one they are looking at. Knowing which choice was made, and what it costs, is the methodology. The number is just where the methodology comes out.

Sidebar — the machinery

For the reader who wants the technical layer:

Fixed-effects (pooled): assumes one true rate; the only uncertainty is binomial sampling within the pool. Interval width scales with 1/√N over the total game count.
Random-effects: assumes a grand mean μ and per-event true rates distributed around it with between-event standard deviation τ. The summary uncertainty combines the within-event standard errors with τ.
Cochran’s Q: a heterogeneity test — sums the weighted squared deviations of event rates from the pooled rate; compared against a χ² distribution, it asks whether the events vary by more than sampling noise alone predicts.
I²: the share of total variation attributable to between-event heterogeneity rather than within-event sampling; I² near 0% means pooling is roughly safe, I² that is large means it is not.
Prediction interval vs confidence interval: a confidence interval bounds the grand mean; a prediction interval bounds the next event’s true rate. The prediction interval is the one that answers “what should I expect at a tournament,” and it is always the wider of the two.

Metas come in clusters

Where the confidence interval goes wrong

Treating each event as a replicate

A worked example

What this means for reading the standings