← Back

From Crawled Pages to Canonical Facts

How LLMs Compress the Web Into Knowledge

Why understanding "canonical facts" matters

Most people imagine that Large Language Models remember pages, articles, or websites. They do not. LLMs remember compressed representations of facts.

During training, billions of pages are reduced into a much smaller set of stable knowledge units. These units are what models rely on when answering questions, explaining concepts, or citing sources.

If your content does not contribute to these stable units, it may be crawled, indexed, and even read, but it will not meaningfully influence what the model knows. This article explains how crawled pages are transformed into canonical facts, why most brand content fails to survive this compression, and what kinds of content reliably become part of a model's long-term knowledge.

What a Canonical Fact Actually Is

A canonical fact is not a sentence. It is not a paragraph. It is an abstracted assertion that survives repeated exposure across sources.

Examples include:

  • - A clear definition of a concept
  • - A stable explanation of how a system works
  • - A widely agreed mechanism or process
  • - A commonly accepted relationship between variables

Canonical facts emerge when multiple sources express the same idea with enough similarity that the model can compress them into a single internal representation. The key word here is agreement.

Why LLMs Must Compress Aggressively

Training an LLM is constrained by compute, memory, and time. Models cannot store the web verbatim. They must compress.

Compression Methods

  • - Deduplication
  • - Abstraction
  • - Generalization
  • - Removal of outliers

The Knowledge Layer Favors

  • - Consistency over novelty
  • - Stability over creativity
  • - Explanation over persuasion

This is not a philosophical choice. It is an engineering necessity.

The Journey From Page to Fact

Step 1: Crawling

Content enters the pipeline through large-scale crawls. Only a subset is crawled, and an even smaller subset is retained.

Step 2: Extraction

Pages are converted to plain text. Layout, visuals, and interactive elements disappear. If meaning depends on design, it is lost.

Step 3: Deduplication

Repeated content is collapsed. Near-identical explanations are merged. Originality without clarity becomes a liability.

Step 4: Abstraction

Remaining content is analyzed for patterns. Similar statements are grouped. The model infers what is stable and what is noise.

Step 5: Canonicalization

Stable patterns become canonical facts. Unstable patterns are discarded. Billions of pages become a small set of knowledge representations.

Why Most Brand Content Fails

Brand content is rarely written with compression in mind. From a human marketing perspective, differentiation makes sense. From a compression perspective, it makes content harder to stabilize.

Feature Hard to Compress (Fails) Easy to Compress (Succeeds)
Phrasing Unique phrasing for common ideas Standard, shared vocabulary
Definitions Avoiding explicit definitions Explicit, clear definitions
Language Mixing marketing with explanation Neutral, explanatory tone
Terminology Changing terminology across pages Consistent terminology
Priority Differentiation over clarity Clarity over differentiation

If the model cannot confidently align your explanation with others, it cannot form a canonical fact around it.

The Importance of Linguistic Alignment

When multiple sources describe the same concept using similar structure and wording, the model gains confidence. When sources use wildly different language, the model treats the concept as unstable.

Why Alignment Matters

  • - Industry-standard terminology matters
  • - Plain language often outperforms clever language
  • - Repeated phrasing across pages increases influence

"Alignment is not plagiarism. It is participation in a shared vocabulary."

Definitions Are the Strongest Anchors

Across training datasets, explicit definitions play an outsized role. Pages that clearly answer "What is X?" tend to contribute more strongly than pages that jump directly to benefits.

A definition provides:

  • - Scope
  • - Boundaries
  • - Disambiguation

Without it, the model struggles to anchor the concept.

Mechanisms Beat Outcomes

Mechanisms survive compression better than outcomes. Mechanisms reduce ambiguity.

Outcome (Vague)

"X improves efficiency"

Often Discarded

Mechanism (Stable)

"X improves efficiency by reducing Y through Z"

Likely to Survive

A Useful Empirical Observation

"Analysis of AI-generated explanations across technical domains shows that models tend to reuse similar phrasing when describing mechanisms."