Skip to content

3. Document Model

3.1 Overview

An MdFlow document is modeled as an ordered tree of nodes. A conforming implementation MUST produce a tree semantically equivalent to the one defined in this chapter, though it MAY use any internal representation.

The pipeline from source to output is:

source (UTF-8 bytes)
  → chunks (§5)
  → tokens (§3.3)
  → AST nodes (§3.4)
  → DOM nodes (§3.5)

All four stages are specified so that intermediate representations can be inspected for testing and for plugin integration.

3.2 Definitions

  • A node is a tree vertex with a kind, an optional content payload, and an ordered sequence of child nodes.
  • The root is the topmost node. Its kind is Document. It has no parent.
  • A block is any node whose kind is listed in §6.1.
  • An inline is any node whose kind is listed in §7.1.
  • A leaf is a node with no children. Every Text node is a leaf.
  • A container is a non-leaf node. Document and every block that accepts children are containers.

3.3 Token layer

During parsing, source is first reduced to a linear sequence of tokens. A token has:

  • a kind (one of the block or inline kinds),
  • a slice — the half-open byte range [start, end) in the source,
  • a state — one of pending or complete (§5),
  • zero or more sub-tokens (used for streaming — e.g., a table row contains sub-tokens for each cell).

The token layer is the unit of incremental processing. Re-parsing a chunk MUST produce a token sequence that is a suffix-addition of the previous sequence — no previously-complete token may change kind, slice start, or existing sub-token structure (see §5.4).

3.4 AST layer

Tokens are assembled into an Abstract Syntax Tree (AST) whose nodes are defined by this specification.

3.4.1 Document root

Document {
  children: Block*
}

3.4.2 Block node types

See §6.1 for the authoritative catalogue. Summary:

  • Paragraph { children: Inline* }
  • Heading { level: 1–6, children: Inline* }
  • ThematicBreak {} (leaf)
  • CodeBlock { lang: string?, content: string } (leaf)
  • BlockQuote { children: Block* }
  • List { ordered: boolean, start: integer?, tight: boolean, children: ListItem* }
  • ListItem { children: Block* }
  • Table { align: Align[], header: TableRow, rows: TableRow* }
  • TableRow { children: TableCell* }
  • TableCell { children: Inline* }
  • CustomBlock { tag: string, attrs: Attr[], children: Block* } (see §8)

3.4.3 Inline node types

  • Text { content: string } (leaf)
  • Emphasis { children: Inline* }
  • Strong { children: Inline* }
  • Strikethrough { children: Inline* }
  • Code { content: string } (leaf)
  • Link { url: URL, title: string?, children: Inline* }
  • Image { url: URL, alt: string, title: string? } (leaf)
  • LineBreak { hard: boolean } (leaf)
  • CustomInline { tag: string, attrs: Attr[], children: Inline* }

3.4.4 Attributes

Attr { name: string, value: string }

Attribute value types are strings; coercion (boolean / number / enum) is consumer-specific and out of scope for the AST layer.

3.4.5 Invariants

The following invariants MUST hold in every AST produced by a conforming parser:

  • INV-1. Blocks contain only blocks, inlines, or (for CodeBlock) raw text. No block may contain another block of an inline-only kind.
  • INV-2. The Heading.level field is in the range 1–6.
  • INV-3. Every Link.url and Image.url has a scheme admitted by §10.3.1 or is the empty URL "" (which MUST render as an inert anchor).
  • INV-4. CustomBlock and CustomInline tags MUST be in the registered whitelist for their context (§8.2).
  • INV-5. Trees are well-formed: every non-root node has exactly one parent; no cycles.
  • INV-6. Inside CodeBlock and Code, the content field is treated as opaque text — no inline parsing, no escape interpretation other than the escaping performed at the lexical layer.

3.5 DOM layer

When rendering to a DOM, each AST node maps to a DOM subtree by the rules in the block and inline chapters. The renderer MUST NOT introduce any DOM node, attribute, or text content beyond what the AST specifies, with two exceptions:

  • The renderer MAY add a data-mdflow-plugin="<name>" attribute to the root element of each plugin-rendered subtree, for debugging and for ownership-tracking in incremental updates.
  • The renderer MAY add a data-mdflow-token-id="<id>" attribute to elements corresponding to a token, for diff tracking.

Both data attributes are informational, not normative output. Vectors that compare DOM output MUST ignore attributes in the data-mdflow-* namespace.

3.6 Serialization to HTML

A renderer targeting HTML string output MUST produce markup semantically equivalent to the DOM output. Whitespace SHOULD be minimal. Attribute quoting style MUST be double-quote. Attribute value escaping MUST follow §10.3.2.

3.7 Canonical form

For comparison purposes (in vectors and in diffs), the canonical HTML form of an AST is defined by the block-by-block and inline-by-inline mapping in this specification. Two renderers that produce the same canonical form are considered to produce equivalent output.

The canonical form does not include data-mdflow-* attributes.