Why consensus matters

The Humans of Humaniti · 2026-04-22 · 6 min

I. The problem consensus exists to solve

Most AI labeling pipelines verify quality by inspection. A vendor labels the data. A second person at the vendor — the QA reviewer — checks a sample. The buyer trusts the sample. This is fine when the labels are easy and the QA reviewer is sober. It is corrosive everywhere else.

Inspection has three failure modes that are not fixable from inside the inspection model. The first is that the QA reviewer sees the original answer. They are anchoring on it. Their disagreement rate with the original labeler is suppressed because they are not labeling fresh — they are reviewing, which is psychologically different work. The second is that the QA reviewer is paid by the same vendor. There is no version of the relationship where the reviewer's incentive is to find their own employer's mistakes at scale. The third is that the QA reviewer is one person. One person does not have a calibration on a dataset. One person has an opinion.

Consensus inverts all three. Multiple Humans label the same item independently, blind to each other's answers. Disagreement is computed between independent labels, not between an answer and its reviewer. The Humans are not paid by the buyer or by each other — they are paid by the network, which has no stake in any single label being correct. The consensus aggregator works on the population of labels, not on a single judgment. The signal is the distribution, not the opinion.

II. The shape of the math

The first instinct is plain majority vote. Three Humans label, the answer is whichever label gets two votes, done. This is a bad aggregator and the literature has known it is a bad aggregator since 1979.

Plain majority vote treats every Human as equally accurate and every disagreement as noise. Both assumptions are wrong. Some Humans are systematically more accurate on this skill than others — that is what per-skill Elo measures, and it is a fact about the population, not a moral judgment about the Human. And disagreement is rarely random. It is correlated with item difficulty, with prompt ambiguity, with edge cases the schema did not anticipate. Throwing it away as noise is throwing away the most informative thing in the dataset.

Three families of methods do real work.

Dawid-Skene (1979) estimates each Human's confusion matrix from their full label history. If a Human is biased toward "yes" on borderline cases, the confusion matrix knows it. The aggregator then weighs the Human's vote accordingly. A Human with a flat, accurate confusion matrix counts more than a Human whose confusion matrix is skewed.

MACE (Hovy et al. 2013) explicitly models spamming — the case where a Human is not really labeling, just clicking through. MACE separates the population into "trying" and "spamming" Humans and discounts the spammers' votes. This matters because in any open marketplace some fraction of participants are not really participating, and the aggregator needs to be robust to that.

CAZ peer prediction (Cai, Agarwal, Zhang 2015) rewards Humans for being informative about what other Humans will say. The mechanism is designed so that random clicking is a losing strategy in expectation — the only way to maximise expected reward is to label honestly. This is the property that makes the system gaming-resistant in the limit. We do not just check labels. We check that the labeler's strategy was the honest one.

The aggregator combines all three. We do not run them independently and pick the winner. We compose them. Dawid-Skene gives the per-Human weighting. MACE gives the spam discount. CAZ gives the strategy-honesty signal. The output is a consensus label and a confidence score. The confidence score is what trips dispute, gold-question audits, and Steward escalation.

III. Why disagreement is the most useful number

The temptation in any verification system is to optimise for agreement. We want every label to be unanimous. Disagreement is failure. This is exactly backwards.

In a well-calibrated dataset, the disagreement rate tells you the difficulty distribution of the items. Items that every Human labels the same way are easy items. Items that split the pool are the items at the boundary of the schema, the items that the model will struggle with at inference time, the items that need either a richer schema or an expert adjudicator. Throwing those items out — or worse, picking a winner by majority and pretending the disagreement did not happen — is throwing away the part of the dataset that actually shapes the model's behavior on hard cases.

The Network treats disagreement as a routing signal, not a problem. High inter-Human-agreement items ship into the dataset directly. Low-agreement items route to T3 Stewards for adjudication. The Steward's outcome is recorded, the disagreement is recorded, the original labels are recorded. None of it is thrown away. The downstream training pipeline gets to see the full distribution of labels per item, not just the consensus answer. That is far more useful than a clean dataset where the dirty parts have been quietly hidden.

IV. Why a single inspector cannot replace this

It is tempting to think you could replace the consensus pool with one very good labeler. You cannot, for two reasons.

The first is the calibration problem. One labeler does not have a calibration on a dataset. The labeler has biases, which they cannot measure from inside their own head. The only way to measure a labeler's biases is to compare their labels against an independent population's labels. The consensus pool is the population. A single labeler is a sample of size one.

The second is the accountability problem. A single labeler can be wrong, and there is no way to know. A pool of labelers can be wrong only in correlated ways — they all anchor on the same misreading of the prompt, they all share the same cultural assumption — and those correlated errors show up in the disagreement structure. The aggregator can detect correlated bias because it is comparing across the population. A single labeler emits errors that look identical to correct answers from outside.

V. Why this is not a luxury feature

A buyer can choose N=1 in our system. We support it. It is the cheapest tier and is appropriate for low-stakes work — content moderation drafts, exploratory tagging, internal experiments. We make the choice available because not every dataset deserves the cost of consensus.

But for any work where the model's outputs route a real-world decision — a court filing, a medical recommendation, a self-driving lane change, a credit determination — the math says you cannot trust a single labeler. You also cannot trust two. The minimum useful N is three, and for high-stakes work the right N is five or seven. The cost scales linearly with N. The confidence scales nonlinearly. You can read the confidence curve as a function of N in the methodology appendix, which we will publish as a separate note when the curve has more data behind it than today's.

The buyer who wants only the cheap tier is a buyer we are happy to serve. The buyer who wants the cheap tier on stakes-bearing work is a buyer we are happy to lose. The Network is built so that the choice is visible, the math is in the open, and the consequences of the choice are auditable in the dataset itself.

— The Humans of Humaniti