← All writing

Machine Legibility Dimensions: A Framework for Notations Designed for AI Readers

The Cognitive Dimensions of Notations framework evaluates languages from a human reader's perspective. As AI systems become primary authors and readers of formal notation, a different set of questions applies. Nine dimensions — tokenisation alignment, ambiguity surface, schema availability, error signal fidelity — and why one honest Neutral matters more than nine Positives.

Michael Shatny··10 min read

The Gap

The Cognitive Dimensions of Notations framework — Green & Petre, 1996 — gives notation designers a vocabulary for thinking about how humans read and write formal languages. Viscosity. Visibility. Error-proneness. Role-expressiveness. Thirteen dimensions, grounded in cognitive science, that have shaped programming language design for thirty years.

The framework was designed for human readers. Its dimensions measure cognitive load, learning curves, the mental effort required to read and modify notation. These are the right questions when the reader is human.

They are not the only questions when the reader is an AI system.

A large language model reading a notation does not experience working memory constraints in the way a human does. It does not get fatigued, cannot scroll back to check a declaration, and does not rely on peripheral vision to notice misaligned structure. But it has its own constraints: it operates on tokenised sequences, its attention is bounded by context windows, its training distribution shapes which vocabulary it handles confidently, and its generation process accumulates errors without the self-correction that human programmers apply mid-keystroke.

These constraints produce a different set of questions about notation design. No existing framework asks them. Machine Legibility Dimensions (MLD) is an attempt to name and define them.

Three Foundational Claims

Before the dimensions: three claims that distinguish this framework from its predecessor.

1

The primary reader changes the evaluation criteria.

A notation optimised for human readers minimises cognitive load and tolerates ambiguity that skilled authors resolve through implicit knowledge. A notation optimised for AI readers minimises inferential state, maximises formal signal per token, and eliminates ambiguity that a language model would resolve by sampling — producing outputs that are statistically plausible but structurally wrong. The optimal design is different.

2

AI generation errors compound; human errors are caught mid-production.

A human programmer writing invalid syntax notices it immediately — through editor highlighting, failed compilation, or the visual signal of broken structure. An AI system generating invalid syntax at position N of a 200-token output does not self-correct — it continues generating from the invalid state, and subsequent tokens are conditioned on the error. Error recovery is a property of the notation itself, not just the tooling.

3

The compiler is an AI runtime component, not just a validator.

In a notation designed for AI composition, the compiler's schema, symbol table, and diagnostic output are inputs to the AI compositor's generation process — not downstream artifacts that happen after AI involvement ends. A notation that exposes its schema as a queryable resource is fundamentally different from one that does not.

The Nine Dimensions

Each dimension is defined, rated on a three-point scale (Positive / Neutral / Weak), and applied to RECALL with evidence from compiler behaviour and source examples. Click any row to expand the definition and evidence.

Eight of nine Positive, one Neutral. The Neutral on Tokenisation Alignment is not a concession — it is the result that makes the rest credible. A framework where the reference implementation scores perfectly on every dimension indicates a framework built backwards.

The Honest Neutral

Tokenisation Alignment deserves a direct treatment because it is the dimension most likely to be waved away.

RECALL's structural keywords are strong. DISPLAY, WITH, PROCEDURE, VALUE are single-token English words, well-represented in LLM training distributions from general language and from COBOL corpora. The probability of generating a correct keyword is concentrated at one decision point.

The identifier convention is the cost. Every hyphenated name — WORKING-STORAGE, HERO-HEADING, PAGE-TITLE, PROGRAM-ID — splits into 3–5 subword tokens in most LLM vocabularies. Every additional token in a name is a multiplication of generation failure probability. For a program with 14 named fields and a consistent hyphenated convention, that multiplication compounds across every reference.

DISPLAY HEADING-1 PAGE-TITLE    ← HEADING-1, PAGE-TITLE split
   WITH STYLE MONO.              ← WITH, STYLE, MONO do not

The SCREAMING convention is a recognisable pattern even when tokens split — structural positions constrain the generation space significantly — so this is not a fatal flaw. But it is a real cost, and rating it Positive would misrepresent the actual tokenisation profile. The resolution would require either empirical measurement or a naming convention change — neither is on the v1.x roadmap.

How MLD and the Cognitive Dimensions Relate

MLD is a companion framework, not a replacement. They evaluate different properties of the same notation from the perspective of different readers.

A notation can score well on Cognitive Dimensions and poorly on MLD: easy for experienced humans to write quickly, but relying on implicit conventions and abbreviations that reduce intent density and increase ambiguity surface for an AI compositor.

A notation can score well on MLD and poorly on Cognitive Dimensions: highly explicit, formally constrained, verbose — high intent density, low ambiguity surface, but weak on diffuseness and progressive evaluation. RECALL is this case. The CD analysis rates Diffuseness as an intentional tradeoff and Progressive Evaluation as the one genuine weakness.

Visibility
Intent DensityAre intent signals visible to the AI?
Error-Proneness
Ambiguity Surface + Error Signal FidelityDoes the notation invite mistakes? Can the compiler catch them?
Abstraction
DecomposabilityCan abstractions be generated independently?
Hidden Dependencies
State SurfaceDoes generation require tracking implicit state?
Consistency
Tokenisation AlignmentConsistent vocabulary maps to consistent token paths

The tension the two frameworks reveal together is instructive. CD rates Progressive Evaluation — the ability to compile and check partial programs — as RECALL's one genuine weakness. MLD does not have a directly corresponding dimension, but the concern is equally real for AI compositors: an AI that cannot check a partial program cannot iterate incrementally. Both frameworks point to the same planned mitigation: --draft mode. The frameworks arrived at the same diagnosis from different directions.

Notation Profile Comparison

The comparison below is analytical, not empirical. It applies MLD to four notations for the same problem domain: structured web content.

DimensionHTMLJSX/ReactMarkdownRECALL
Tokenisation AlignmentNeutralNeutralPositiveNeutral
Ambiguity SurfaceWeakWeakNeutralPositive
State SurfaceWeakWeakPositivePositive
Schema AvailabilityNeutralNeutralNonePositive
Error Signal FidelityWeakNeutralNonePositive
Intent DensityWeakNeutralWeakPositive
Round-trip FidelityWeakWeakNeutralPositive
Constraint CompletenessWeakNeutralWeakPositive
DecomposabilityNeutralWeakPositivePositive

HTML and JSX score weakly on most dimensions — consistent with observable AI behaviour. AI-generated HTML frequently produces plausible structure with semantically incorrect attributes, over-engineered class hierarchies, and layout decisions that compile but fail design intent. The weak scores explain the observable failure modes.

Markdown scores well on tokenisation alignment, state surface, and decomposability — it is simple, stateless, and chunks naturally. It scores weakly on constraint completeness and intent density. Markdown is well-suited to AI generation of prose. It is not suited to AI composition of structured interfaces.

What This Means for Notation Designers

The practical implication is direct. Each MLD dimension points to a concrete design decision that either exists or does not.

Expose your schema as a queryable API — not a README
Make your diagnostic system return structured JSON with stable codes
Encode intent at the field level, not just the structural level
Close your element vocabulary — make the compiler reject unknown elements
Separate mutable state from declaration — give AI compositors a fixed contract to generate against

These are not abstract properties. Each one corresponds to a feature that either exists or does not. The framework makes them visible as a category — and naming a category is the first step to being able to design for it.

The full framework document — including the complete dimension definitions, foundational claims, limitations, and the relationship to Cognitive Dimensions — is published in the RECALL compiler repository under docs/MACHINE-LEGIBILITY-FRAMEWORK.md. It is proposed, not validated. Empirical follow-up — controlled generation studies, tokenisation analysis, multi-analyst review — is the natural next step.

Michael Shatny is a software developer and methodology engineer and founding contributor to .netTiers (2005–2010), one of the earliest schema-driven code generation frameworks for .NET. His work spans 28 years of the same architectural pattern: structured input, generated output, auditable artifacts. RECALL is the latest expression of that instinct — applied to the question of what a web language looks like when it is designed for intent first.

ORCID: 0009-0006-2011-3258