§ How it works

How a label becomes verified.

A Builder publishes a task. Verified Humans label it. Consensus aggregates the labels. Per-skill Elo weights each Human's vote. A Steward adjudicates disagreement. Gold questions catch drift. The signed dataset ships. The Humans get paid. That is the entire loop, end to end.

§ 01 — Lifecycle

From request to receipt.

  1. 01
    Builder publishes
  2. 02
    Assigned to N Humans
  3. 03
    Humans label
  4. 04
    Consensus aggregates
  5. 05
    Dispute → Steward
  6. 06
    Signed dataset
  7. 07
    Humans paid

A Builder defines the task — its prompt, its label schema, its consensus N, its quality thresholds — and publishes. The qualified pool is filtered by tier, skill, language, and region. Each item routes to N Humans independently. The aggregator weighs each label by Elo, accounts for known per-Human bias, and returns a consensus label plus a confidence score.

If confidence is high, the label ships into the signed dataset and the Humans get paid. If confidence is low or a Human appeals, the item routes to a Steward. The Steward's outcome is final and writes to the append-only audit log.

§ 02 — The Humans

Quality is a ladder.

T0

Signed up

Account created. Network browseable. No tasks until phone verified.

T1

Phone verified

Low-stakes work unlocked. Per-skill Elo seeded by qualification arena.

T2

ID verified

Standard work and higher per-label rates. Sumsub liveness + ID check.

T3

Reputation verified

Sensitive work, Steward eligibility, dispute adjudication. Highest rates.

On top of the four tiers, each Human carries a per-skill Elo — bbox, Text NER, audio transcription, RLHF preference, and so on. New Humans seed their Elo through a qualification arena of gold-graded items. Verified accuracy raises the score. Drift lowers it. Higher Elo unlocks higher-paying work in that specific skill.

§ 03 — The Builders

What a Builder controls.

  1. 01
    N-Human consensus.
    1 (cheap-draft) / 3 (default) / 5 (high-stakes) / 7 (max). Cost scales linearly with N. Confidence scales nonlinearly.
  2. 02
    Gold-question density.
    Percent of items in each batch that are known-answer plants. Higher density tightens calibration; lower density saves cost.
  3. 03
    Target agreement threshold.
    The inter-Human agreement (IAA) score below which a label routes to dispute. Tunable per project. Defaults tuned per task type.
  4. 04
    Language and region constraints.
    Restrict the qualified pool by language, region, or device. Required for locale-sensitive work.
  5. 05
    Steward escalation thresholds.
    Disagreement count, time-on-task, gold-question failure rate — any of these can trip Steward review. All defaults configurable.
§ 04 — Consensus

Disagreement is data.

Plain majority vote is a bad aggregator. It treats every Human as equally accurate and every disagreement as noise. The Network treats disagreement as signal. Three families of methods do the work — Dawid-Skene estimates each Human's confusion matrix from their full label history and weighs their vote accordingly. MACE separates random spamming from honest disagreement. CAZ peer-prediction rewards Humans for being informative about what other Humans will say, which makes random clicking a losing strategy.

The aggregator returns two numbers — the consensus label and a confidence score. The confidence score is what trips dispute, gold-question audits, and Steward escalation. The math stays inside the Network. The Builder sees the label, the confidence, the audit trail, and the receipts.

§ 05 — Anti-gaming

Why this can't be cheated.

  • Behavioral telemetry — keystroke timing, pointer paths, dwell, scroll. Anomalies route to review.
  • Text-detection ensemble — multi-model classification of generated text in free-response fields.
  • Gold-question injection — known-answer items mixed continuously into every batch, blind to the Human.
  • Wallet and device clustering — fingerprint and payout-rail signals collapse Sybil rings.
  • Steward audits — T3 Humans periodically re-grade samples and review flagged sessions.
  • Phase 3 slashing — staked credits are slashable for confirmed gaming once the credit unit ships.
§ 06 — Disputes

What happens when Humans disagree.

  1. 01
    Mandatory rejection reason.
    Any Human who disagrees with a peer must record the reason. Free-text plus a structured taxonomy.
  2. 02
    72-hour appeal window.
    The original Human is notified. They can accept, revise, or escalate within 72 hours. Silence accepts the consensus.
  3. 03
    Blind re-annotation.
    If escalated, the item routes to a fresh pool of Humans who see the prompt without prior labels or rejection notes.
  4. 04
    Steward adjudication.
    A T3 Steward reviews the item, the consensus history, and the appeal. The outcome is final and posts to the audit log.
§ 07 — Methodology

Read deeper.

The shape above is the public model. The implementation has more — confidence calibration curves, Elo prior selection, gold-question generation policy. The Field notes go further into the math.

Read the methodology →