Skip to content

4. Lexical Structure

4.1 Character set

Source text is a sequence of Unicode characters encoded as UTF-8. A conforming implementation MUST decode UTF-8. Invalid UTF-8 byte sequences MUST be replaced with U+FFFD per the Unicode standard's W3C-stable substitution practice.

Source bytes that encode U+0000 MUST be replaced with U+FFFD before any further processing, in both the parsing and rendering paths.

4.2 Line endings

A line ending is one of:

  • U+000D U+000A (CR LF)
  • U+000A (LF)
  • U+000D (CR)

All three are treated as equivalent and are internally normalized to U+000A before block-level analysis. Line ending in the final line of source (before flush()) is OPTIONAL.

4.3 Lines

A line is a sequence of source characters delimited by line endings, exclusive of the line ending. The number of lines in source S is one more than the number of line endings in S.

A blank line is a line containing only the characters U+0009, U+000B, U+000C, U+0020 (tab, vertical tab, form feed, space).

4.4 Whitespace

Whitespace in MdFlow source is any sequence of U+0009, U+000A, U+000B, U+000C, U+000D, U+0020. Within blocks, whitespace MAY be significant (e.g., in code blocks and tables); within inlines, whitespace collapses according to the block and inline rules in §6 and §7.

4.4.1 Tab expansion

A tab (U+0009) at the start of a block context is expanded to the next multiple of 4 columns. Tabs elsewhere are preserved.

4.5 ASCII categories

This specification uses the following ASCII categories:

  • ASCII letter: U+0041–U+005A, U+0061–U+007A.
  • ASCII digit: U+0030–U+0039.
  • ASCII alphanumeric: letter or digit.
  • ASCII punctuation: any of !"#$%&'()*+,-./:;<=>?@[\]^_`{|}~` .

4.6 Escapes

A backslash (\) followed by an ASCII punctuation character escapes that character: the pair is treated as a literal single character in inline contexts, with markdown-significance removed. A backslash not followed by an ASCII punctuation character is treated as a literal backslash.

In code blocks and code spans, backslashes are literal.

4.7 Chunk definition (streaming)

For the purpose of §5 Streaming Model, a chunk is any contiguous byte range of source passed to push(). A chunk MAY split a character, a line, or a token; see §5.9 for the byte-boundary handling rule.

4.8 Comments

MdFlow source has no comment syntax. Sequences that might appear to be HTML comments (<!-- ... -->) are not recognized and are processed as literal text subject to the inline rules.

4.9 Directives (deferred)

This specification reserves ::: fences and {key=value} attribute blocks syntactically. They MUST NOT be parsed in v1.0; implementations MUST treat them as paragraph text. This reservation protects future extension.

4.10 End-of-input

End-of-input is signaled by flush() in streaming mode and by the end of input source in batch mode. Any pending tokens are finalized per their kind's finalization rule.