Swiss Isn't Random — The Infinite Archive

Going 5–0 means you played the top of the bracket; going 1–4 means you played the bottom. A tournament isn’t a stack of independent games — it’s a sorting machine, and the standard statistics quietly assume it isn’t.

A win rate is built by counting games. A significance test built on top of it — the kind that decides whether “Faction A beats Faction B 58% of the time” is a real edge or just noise — does something more specific: it treats each of those games as one independent data point. A two-proportion test on 240 recorded games reads them as 240 independent draws. That independence is the load-bearing assumption underneath the result.

Swiss pairings quietly violate it.

Pairing is conditioned on performance

Everyone who has played a tournament knows the mechanic without thinking of it as one. After round one, you are paired against someone on your own record. Win, and round two is against another winner; lose, and it’s against another player who lost. Round by round, the bracket sorts the room — and who you play next is decided by how you have done so far.

That is the whole problem in one sentence. A win makes your next opponent stronger; a loss makes them weaker. The five games on your sheet are not five independent draws from “the field.” They are a path through the field, and each step of the path depends on the steps before it.

Consequence one: difficulty isn’t constant

Because Swiss sorts by record, a faction’s late-round games are played, on average, against stronger opposition than its early-round games. The faction that keeps winning keeps climbing into a tougher bracket. The faction that stumbles slides into a softer one.

So when a matchup is reported as “Faction A at 58%,” that 58% is not 58% against a fixed-difficulty field. It is 58% against a field whose difficulty rose as Faction A kept winning and fell as it lost. The pooled number averages together easy early-round games and hard late-round games as though they were the same kind of observation. They are not. This doesn’t make the number wrong — it makes it a blend, and a blend reported as a single rate hides the fact that its ingredients were unequal. (This is the thread Player or Army? picks up: separating the faction’s contribution from the strength of the bracket it climbed.)

Consequence two: the games cluster

The deeper issue is what clustering does to confidence.

Games from the same event share conditions — the same terrain, the same room of players, the same weekend’s metagame. Games from the same bracket of the same event share even more. Observations that share conditions are correlated: knowing the outcome of one tells you a little about the next, which is exactly what “independent” rules out.

Correlated observations carry less information than the same number of independent ones. Two hundred and forty games drawn from twenty events are not 240 independent facts about the matchup — they are something closer to 240 facts with a discount applied. Statisticians call the discounted figure the effective sample size, and the ratio between the nominal count and the effective count the design effect.

This is where the overstatement enters, and it enters by construction. A confidence interval narrows as the sample size grows — interval width scales with one over the square root of N. A significance test that plugs in the raw game count is using an N larger than the data actually supports. Its interval therefore comes out narrower than it should, and its p-value smaller. The test does not know the games were clustered; it reports a confidence the clustering does not entitle it to.

A worked example

Take a synthetic matchup — Faction A against Faction B — observed at 58% across 240 games.

Run a standard two-proportion calculation and the interval comes out at roughly ±6 points: 58% ± 6, an interval of about [52%, 64%]. It clears 50% comfortably. The natural reading is “Faction A holds a real edge in this matchup, and we’re confident of it.”

Now suppose those 240 games came from twenty events, twelve games apiece, and that games within an event carry a modest positive correlation. That correlation inflates the design effect — here, to roughly two — which means the effective sample size is not 240 but around 120. Recompute the interval on the honest effective count and it widens to about ±9 points: 58% ± 9, an interval of about [49%, 67%].

Same 240 games. The naive interval excludes 50% and calls the matchup a settled edge. The cluster-aware interval includes 50% and says the edge, while still the best estimate, cannot be cleanly distinguished from no edge at all. Nothing about the games changed — only whether the arithmetic was told they came from a sorting machine.

Closing

None of this is a case against Swiss. Swiss is a good tournament format: it sorts a large room into a fair finishing order quickly, and it does that job well. The point is narrower. Swiss is not a random sample of games, and its output should not be read as one.

Independence is a convenience the standard tests assume because it makes the arithmetic clean — it is not a fact about tournament data. A careful significance claim has to do one of two things: earn the assumption, or account for its absence. A sorting machine, read honestly — with the clustering acknowledged and the effective sample size used — still tells you a great deal. It just tells you less, and less precisely, than a test that pretends the machine isn’t there.

Footnote — back to Six Bins

There is a second, subtler way Swiss breaks independence, and it runs the other direction. Inside a single player’s five-game run, the pairing correlation is negative: winning raises the next opponent’s strength, which pulls the player back toward the middle. The format actively fights streaks. This is why the independent-binomial picture in Six Bins is an approximation — the true distribution of a player’s record is slightly narrower than the binomial predicts, because Swiss is built to be self-correcting. The clustering described above (positive correlation, across the dataset) and this within-player effect (negative correlation, along one player’s path) are different consequences of the same pairing rule, at different scales.

Sidebar — the machinery

For the reader who wants the technical layer:

Independence / IID: standard proportion and two-proportion tests assume every observation is an independent, identically distributed draw. It is the assumption that makes the interval formula valid.
Intra-cluster correlation (ρ): the degree to which observations within the same cluster (here, an event or a bracket) resemble each other beyond chance.
Design effect (DEFF): the factor by which clustering inflates the variance, approximately 1 + (m̄ − 1)·ρ, where m̄ is the average cluster size. A design effect of 2 means the variance is twice what an independent sample would give.
Effective sample size: the nominal sample size divided by the design effect — the count of independent observations the clustered data is actually worth. Confidence intervals and p-values should be computed on this figure, not the raw count.