AHD · Methodology
How we measure.
The headline numbers on the home page deserve a plain-English explanation. This is that.
The pairing
Every measured run takes
one brief,
resolves it against one style token, and runs it through a set
of models in two conditions. Both conditions send the same
brief as the user prompt. The difference is what
goes in the system slot.
Example brief briefs/landing.yml
The brief used in the 22 April run, rendered field-by-field.
Each field's string value is what the model receives:
intent, audience, and
surfaces are flattened into the
user prompt string verbatim.
mustInclude and mustAvoid become
bulleted lists inside the same string. The
token field names which compiled system prompt
to pair the brief with under the compiled condition.
Snapshot caveat: the brief is reproduced
verbatim so verify-replay still validates the
published 22 April hashes. The framework has grown since:
today the source linter runs 38 rules (35 HTML/CSS + 3 SVG),
14 vision-only rules and 6 mobile-audit rules, with 10
shipped tokens. A re-run today would point at a fresh brief
with current numbers.
- intent
- A one-page landing for AHD itself (Artificial Human Design). The visitor is a working designer, a senior engineer, or a founder who has seen too many AI-generated sites in the last month and already suspects the problem. They do not need to be sold on the slop thesis; they need to be shown that AHD takes it seriously and does something about it. One screen of reading. A terminal-aesthetic code block showing the one-liner that runs the eval. A manifest section that lists exactly what ships today versus what is gated on external resources. No testimonials, no pricing tier, no "Trusted by" strip.
- audience
- Working designers, senior engineers, and early-stage founders who already have an opinion about AI-generated design and would like a framework rather than a lecture.
- token
swiss-editorial- surfaces
web- mustInclude
-
- A link to github (or the forgejo mirror) labelled as the read-the-code CTA.
- The command
ahd eval-live swiss-editorial --brief briefs/landing.yml --models <specs> --n 10rendered in a monospace code block with the language tagbash. - A clearly separated manifest section titled 'What ships' that enumerates: brief compiler, 28-rule linter, eval harness, vision-critic scaffold, MCP server, editor plugins, eight tokens.
- A clearly separated manifest section titled 'What is gated' that enumerates: live frontier-model calls (needs API keys), live vision critique (needs multimodal key + screenshot pipeline), standalone npm packages for the editor plugins, additional tokens.
- A footer that names the licence explicitly: FSL-1.1-Apache-2.0 for code, CC-BY-4.0 for tokens and artwork.
- mustAvoid
-
- any variant of 'build the future of'
- any variant of 'ship faster' or 'AI-native'
- gradient text on the word AI
- three equal feature cards
- centred hero stack with pill badge + two CTAs
- fake testimonials
- a Trusted by logo bar
- emoji bullets
- Lucide icons in gradient tiles
- purple-to-blue hero gradient
- AI shimmer
- Corporate Memphis illustration
- an iridescent 3D blob
- a canonical 4-column Product / Company / Resources / Legal footer
The canonical YAML source is in
briefs/landing.yml
in the framework repository. The rendering above shows the
same values a parser would produce; the repo is
authoritative.
The raw condition sends the brief as the
user prompt and puts nothing in the
system slot. The brief itself is the intent, the
audience, the surfaces, the things it explicitly wants and the
things it explicitly bans, serialised as plain prose. No style
token, no AHD system prompt, no forbidden list, no role framing.
The compiled condition sends the same brief
prose in the user slot, unchanged, and AHD's
token-specific system prompt in the system slot.
The system prompt names the style direction (grid, typography,
palette, motion policy), names the forbidden patterns from the
taxonomy and asks the model to cite the rule it is following
in an inline comment per decision.
The only thing that differs between raw and compiled is what
AHD puts in the system slot. The brief is identical,
the model is identical, the token budget is identical and the
seed where available is identical. This is the controlled
comparison the framework exists to support.
A cell
A cell is a specific combination of one model and one condition. The 22 April 2026 n=30 run used ten models and two conditions, so twenty cells. Each cell received thirty samples, written as n=30 per cell. Ten models times two conditions times thirty samples is six hundred generated HTML pages. Each was linted. The tell counts were averaged per cell. The delta is the per-model difference between the raw-condition mean and the compiled-condition mean.
Why n=30 is the credible baseline
With five samples per cell, the statistical precision is poor. The Wilson confidence interval on each per-cell percentage lands at roughly plus or minus thirty-five points. That is enough to see a direction, not enough to quote a number. The 21 April n=5 runs are preserved as directional signal; the quotable numbers come from the 22 April n=30 run.
At n=30 per cell the Wilson interval tightens to roughly plus or minus eighteen points. The top four cells in the 22 April run (gpt-oss-120b at seventy-eight percent, Mistral Small and Kimi K2.6 both at sixty-three percent, Gemini 3.1 Pro at sixty-two percent) all sit outside that band, so each cell's signal survives the interval independently.
The run used zero API budget. Frontier cells (Claude Opus, gpt-5.4, Gemini 3.1 Pro) routed through their provider's subscription CLI, the path most humans actually use for these models today. OSS cells used the Cloudflare Workers AI free tier. The "n=30 is on the roadmap when budget permits" line that used to live here is no longer true; subscription CLIs removed the budget gate. The remaining cost of an n=30 run is wall-clock time and careful runner plumbing, not dollars.
Attempted, extracted, scored
Every published eval report lists four counts per cell: attempted (runs initiated), errored (API or runtime failures), extraction failed (the response contained no usable HTML), and scored (samples that actually reached the linter). A large gap between attempted and scored is a signal that a model is struggling with the instruction, not that it passed the taxonomy.
An earlier version of the runner silently dropped failed samples and reported only the scored count. We changed it because survivorship bias of that shape made the headline flattering for the wrong reasons. Present counts are separate for exactly that reason.
Mean tells per page is a proxy, not a verdict
Fewer tells per page is not identical to better design. A page with almost no content has nothing for the linter to fire on and will score near zero regardless of intent. A page with genuine ambition exposes more decision surface and can legitimately trip more rules than a thin page. Tells-per-page is a useful proxy for aggregate slop fingerprint, not a single-sample judgement call. Always read a per-cell number alongside the rendered output, not in isolation. The published reports link to the samples.
The linter as scorer
The scoring engine is the same deterministic ruleset that ships
with the ahd lint CLI. Thirty-five rules decide from
HTML and CSS. Three decide from SVG. Fourteen vision-only rules
live behind a multimodal critic and only run when we've rendered
the sample. The vision critic today runs through the Claude Code
CLI by default, so the cost of a vision pass is zero for anyone
with a Claude Code subscription; the Anthropic HTTP API path
remains available as a fallback. A per-release inventory of
exactly which rules ran is stamped into the report header so an
older run remains auditable after the ruleset changes.
Negative results are first-class
The 22 April n=30 run reports Llama 3.3 70B regressing one hundred and seventeen percent under the compiled prompt, which reproduces the same-direction regression measured at n=5 on two independent serving paths. Qwen 3 30B moved only eight percent, inside the Wilson interval and therefore flat, not a win. The earlier image-gen pipeline run reported SDXL Lightning ignoring the compiled negative entirely. We publish these because a framework that only surfaced wins would be ornamental, and because the comparative shape of who-benefits is more actionable than any single aggregate number. If a model does not improve under the compiled brief for a given token, the correct move is to route around that model for that token, not to blame the framework.
What compiled does and does not add
The compiled system prompt is a structured document. The style
direction comes from the token's prose fragment. The forbidden
list is drawn from the token's forbidden: array
merged with any brief-level mustAvoid. The required
quirks come from the token's required-quirks array.
The full spec is also serialised to JSON and included so a model
that responds better to structured input has it available. The
final working rules ask the model to cite the rule it follows
per decision, to return single HTML output (in mode: final,
used by the eval), and to favour subtraction when in doubt.
What the compiled prompt does not add: API keys, per-model tuning, any hidden example shots, or any automation that silently rejects and regenerates a bad response. The runner calls the model once per sample. If the response is bad, it shows up bad in the report.
The serving layer is not the model
A benchmark cell is not a model. It is a model as served by a specific host in a specific release window with a specific chat template. Cloudflare Workers AI serving Kimi K2.6 is a different target than Moonshot serving the same Kimi K2.6 weights, which is a different target again from a self-hosted vLLM deployment of the same weights. The weights are Moonshot's release. The quantization, chat template, default generation parameters, load balancer, and batching are Cloudflare's. Findings against one are not findings against the others.
Every report cell on this site names the serving path, not just the model. "Kimi K2.6 via Cloudflare Workers AI, 22 April 2026" carries the information a reader needs. The date matters because hosts change defaults between releases. The 22 April n=30 run caught exactly this: Cloudflare's 20 April Kimi K2.6 release turned thinking-mode on by default and renamed the suppression knob, and AHD's runner had to ship a K2.6-specific patch to produce any visible output at all under the compiled condition. See the serving tells catalog for named patterns of this shape.
This means natural-language prompt instructions cannot always override provider defaults. "No reasoning, no prose commentary" in a prompt is parsed as instruction about the visible response; it does not reach the chat-template layer where thinking-mode defaults live. The eval is prompt-controlled given whatever chat-template the model host ships. Template defaults are infrastructure, not prompt, and they belong alongside the prompt in any honest writeup.
Reproducing a run
Every measurement published on this site points at a dated report
in docs/evals/ in the repository. Each report carries
its run manifest, which records the exact brief path, the exact
model specifications (including canonical model identifiers like
@cf/mistralai/mistral-small-3.1-24b-instruct, not
just 'Mistral'), the n per cell, and the ISO timestamp. Given the
manifest and a current version of the framework, the run is
reproducible in one command:
ahd eval-live <token> --brief <brief.yml> \
--models <specs> --n <N> \
--report docs/evals/<date>-<token>.md Model versions do change, so an exact re-run against the same canonical identifier is only guaranteed to be close, not identical. The manifest records the identifier; reproducing a run against a different model is honest and what the manifest is for.
Last updated 22 April 2026. Updated as the methodology evolves. For the list of rules the linter actually enforces, see the taxonomy. For the code, see the framework repository.