AHD · Methodology

How we measure.

The headline numbers on the home page deserve a plain-English explanation. This is that.

The pairing

Every measured run takes one brief, resolves it against one style token, and runs it through a set of models in two conditions. Both conditions send the same brief as the user prompt. The difference is what goes in the system slot.

Example brief briefs/landing.yml

The brief used in the 22 April run, rendered field-by-field. Each field's string value is what the model receives: intent, audience, and surfaces are flattened into the user prompt string verbatim. mustInclude and mustAvoid become bulleted lists inside the same string. The token field names which compiled system prompt to pair the brief with under the compiled condition.

Snapshot caveat: the brief is reproduced verbatim so verify-replay still validates the published 22 April hashes. The framework has grown since: today the source linter runs 38 rules (35 HTML/CSS + 3 SVG), 14 vision-only rules and 6 mobile-audit rules, with 10 shipped tokens. A re-run today would point at a fresh brief with current numbers.

intent

A one-page landing for AHD itself (Artificial Human Design). The visitor is a working designer, a senior engineer, or a founder who has seen too many AI-generated sites in the last month and already suspects the problem. They do not need to be sold on the slop thesis; they need to be shown that AHD takes it seriously and does something about it. One screen of reading. A terminal-aesthetic code block showing the one-liner that runs the eval. A manifest section that lists exactly what ships today versus what is gated on external resources. No testimonials, no pricing tier, no "Trusted by" strip.

audience

Working designers, senior engineers, and early-stage founders who already have an opinion about AI-generated design and would like a framework rather than a lecture.

token

swiss-editorial

surfaces

web

mustInclude

A link to github (or the forgejo mirror) labelled as the read-the-code CTA.
The command ahd eval-live swiss-editorial --brief briefs/landing.yml --models <specs> --n 10 rendered in a monospace code block with the language tag bash.
A clearly separated manifest section titled 'What ships' that enumerates: brief compiler, 28-rule linter, eval harness, vision-critic scaffold, MCP server, editor plugins, eight tokens.
A clearly separated manifest section titled 'What is gated' that enumerates: live frontier-model calls (needs API keys), live vision critique (needs multimodal key + screenshot pipeline), standalone npm packages for the editor plugins, additional tokens.
A footer that names the licence explicitly: FSL-1.1-Apache-2.0 for code, CC-BY-4.0 for tokens and artwork.

mustAvoid

any variant of 'build the future of'
any variant of 'ship faster' or 'AI-native'
gradient text on the word AI
three equal feature cards
centred hero stack with pill badge + two CTAs
fake testimonials
a Trusted by logo bar
emoji bullets
Lucide icons in gradient tiles
purple-to-blue hero gradient
AI shimmer
Corporate Memphis illustration
an iridescent 3D blob
a canonical 4-column Product / Company / Resources / Legal footer

The canonical YAML source is in briefs/landing.yml in the framework repository. The rendering above shows the same values a parser would produce; the repo is authoritative.

The raw condition sends the brief as the user prompt and puts nothing in the system slot. The brief itself is the intent, the audience, the surfaces, the things it explicitly wants and the things it explicitly bans, serialised as plain prose. No style token, no AHD system prompt, no forbidden list, no role framing.

The compiled condition sends the same brief prose in the user slot, unchanged, and AHD's token-specific system prompt in the system slot. The system prompt names the style direction (grid, typography, palette, motion policy), names the forbidden patterns from the taxonomy and asks the model to cite the rule it is following in an inline comment per decision.

The only thing that differs between raw and compiled is what AHD puts in the system slot. The brief is identical, the model is identical, the token budget is identical and the seed where available is identical. This is the controlled comparison the framework exists to support.

A cell

A cell is a specific combination of one model and one condition. The 22 April 2026 n=30 run used ten models and two conditions, so twenty cells. Each cell received thirty samples, written as n=30 per cell. Ten models times two conditions times thirty samples is six hundred generated HTML pages. Each was linted. The tell counts were averaged per cell. The delta is the per-model difference between the raw-condition mean and the compiled-condition mean.

Why n=30 is the credible baseline

With five samples per cell, the statistical precision is poor. The Wilson confidence interval on each per-cell percentage lands at roughly plus or minus thirty-five points. That is enough to see a direction, not enough to quote a number. The 21 April n=5 runs are preserved as directional signal; the quotable numbers come from the 22 April n=30 run.

At n=30 per cell the Wilson interval tightens to roughly plus or minus eighteen points. The top four cells in the 22 April run (gpt-oss-120b at seventy-eight percent, Mistral Small and Kimi K2.6 both at sixty-three percent, Gemini 3.1 Pro at sixty-two percent) all sit outside that band, so each cell's signal survives the interval independently.

The run used zero API budget. Frontier cells (Claude Opus, gpt-5.4, Gemini 3.1 Pro) routed through their provider's subscription CLI, the path most humans actually use for these models today. OSS cells used the Cloudflare Workers AI free tier. The "n=30 is on the roadmap when budget permits" line that used to live here is no longer true; subscription CLIs removed the budget gate. The remaining cost of an n=30 run is wall-clock time and careful runner plumbing, not dollars.

Attempted, extracted, scored

Every published eval report lists four counts per cell: attempted (runs initiated), errored (API or runtime failures), extraction failed (the response contained no usable HTML), and scored (samples that actually reached the linter). A large gap between attempted and scored is a signal that a model is struggling with the instruction, not that it passed the taxonomy.

An earlier version of the runner silently dropped failed samples and reported only the scored count. We changed it because survivorship bias of that shape made the headline flattering for the wrong reasons. Present counts are separate for exactly that reason.

Mean tells per page is a proxy, not a verdict

Fewer tells per page is not identical to better design. A page with almost no content has nothing for the linter to fire on and will score near zero regardless of intent. A page with genuine ambition exposes more decision surface and can legitimately trip more rules than a thin page. Tells-per-page is a useful proxy for aggregate slop fingerprint, not a single-sample judgement call. Always read a per-cell number alongside the rendered output, not in isolation. The published reports link to the samples.

The linter as scorer

The scoring engine is the same deterministic ruleset that ships with the ahd lint CLI. Thirty-five rules decide from HTML and CSS. Three decide from SVG. Fourteen vision-only rules live behind a multimodal critic and only run when we've rendered the sample. The vision critic today runs through the Claude Code CLI by default, so the cost of a vision pass is zero for anyone with a Claude Code subscription; the Anthropic HTTP API path remains available as a fallback. A per-release inventory of exactly which rules ran is stamped into the report header so an older run remains auditable after the ruleset changes.

Negative results are first-class

The 22 April n=30 run reports Llama 3.3 70B regressing one hundred and seventeen percent under the compiled prompt, which reproduces the same-direction regression measured at n=5 on two independent serving paths. Qwen 3 30B moved only eight percent, inside the Wilson interval and therefore flat, not a win. The earlier image-gen pipeline run reported SDXL Lightning ignoring the compiled negative entirely. We publish these because a framework that only surfaced wins would be ornamental, and because the comparative shape of who-benefits is more actionable than any single aggregate number. If a model does not improve under the compiled brief for a given token, the correct move is to route around that model for that token, not to blame the framework.

What compiled does and does not add

The compiled system prompt is a structured document. The style direction comes from the token's prose fragment. The forbidden list is drawn from the token's forbidden: array merged with any brief-level mustAvoid. The required quirks come from the token's required-quirks array. The full spec is also serialised to JSON and included so a model that responds better to structured input has it available. The final working rules ask the model to cite the rule it follows per decision, to return single HTML output (in mode: final, used by the eval), and to favour subtraction when in doubt.

What the compiled prompt does not add: API keys, per-model tuning, any hidden example shots, or any automation that silently rejects and regenerates a bad response. The runner calls the model once per sample. If the response is bad, it shows up bad in the report.

The serving layer is not the model

A benchmark cell is not a model. It is a model as served by a specific host in a specific release window with a specific chat template. Cloudflare Workers AI serving Kimi K2.6 is a different target than Moonshot serving the same Kimi K2.6 weights, which is a different target again from a self-hosted vLLM deployment of the same weights. The weights are Moonshot's release. The quantization, chat template, default generation parameters, load balancer, and batching are Cloudflare's. Findings against one are not findings against the others.

Every report cell on this site names the serving path, not just the model. "Kimi K2.6 via Cloudflare Workers AI, 22 April 2026" carries the information a reader needs. The date matters because hosts change defaults between releases. The 22 April n=30 run caught exactly this: Cloudflare's 20 April Kimi K2.6 release turned thinking-mode on by default and renamed the suppression knob, and AHD's runner had to ship a K2.6-specific patch to produce any visible output at all under the compiled condition. See the serving tells catalog for named patterns of this shape.

This means natural-language prompt instructions cannot always override provider defaults. "No reasoning, no prose commentary" in a prompt is parsed as instruction about the visible response; it does not reach the chat-template layer where thinking-mode defaults live. The eval is prompt-controlled given whatever chat-template the model host ships. Template defaults are infrastructure, not prompt, and they belong alongside the prompt in any honest writeup.

Reproducing a run

Every measurement published on this site points at a dated report in docs/evals/ in the repository. Each report carries its run manifest, which records the exact brief path, the exact model specifications (including canonical model identifiers like @cf/mistralai/mistral-small-3.1-24b-instruct, not just 'Mistral'), the n per cell, and the ISO timestamp. Given the manifest and a current version of the framework, the run is reproducible in one command:

ahd eval-live <token> --brief <brief.yml> \
  --models <specs> --n <N> \
  --report docs/evals/<date>-<token>.md

Model versions do change, so an exact re-run against the same canonical identifier is only guaranteed to be close, not identical. The manifest records the identifier; reproducing a run against a different model is honest and what the manifest is for.

Last updated 22 April 2026. Updated as the methodology evolves. For the list of rules the linter actually enforces, see the taxonomy. For the code, see the framework repository.