CodeProject.Syntax.LALR
2.1.0
Moved to diff org
dotnet add package CodeProject.Syntax.LALR --version 2.1.0
NuGet\Install-Package CodeProject.Syntax.LALR -Version 2.1.0
<PackageReference Include="CodeProject.Syntax.LALR" Version="2.1.0" />
<PackageVersion Include="CodeProject.Syntax.LALR" Version="2.1.0" />
<PackageReference Include="CodeProject.Syntax.LALR" />
paket add CodeProject.Syntax.LALR --version 2.1.0
#r "nuget: CodeProject.Syntax.LALR, 2.1.0"
#:package CodeProject.Syntax.LALR@2.1.0
#addin nuget:?package=CodeProject.Syntax.LALR&version=2.1.0
#tool nuget:?package=CodeProject.Syntax.LALR&version=2.1.0
CodeProject.Syntax.LALR
A modernized LALR(1) parser-table generator and runtime for C#, originally adapted
from Phillip Voyle's CodeProject article
LALR Parse Table Generation in C#.
The parsing loop follows the GOLD parser pseudocode; the lexer pipeline was rewritten
to consume UTF-8 bytes end-to-end via System.IO.Pipelines and a typed, byte-level
DFA.
Grammars can be defined inline in C# or declaratively in *.lalr.yaml files
that a Roslyn source generator turns into a typed schema, AST records, and a
visitor surface at build time.
Status (May 2026): .NET 10, C# 14, Native AOT-clean, xUnit v3 test suite (283 tests). Library code is allocation-light, reflection-free, and trim-/AOT- compatible. Single NuGet package ships the runtime and the source generator.
Quick start (NuGet)
<ItemGroup>
<PackageReference Include="CodeProject.Syntax.LALR" Version="2.0.0" />
<AdditionalFiles Include="grammar.lalr.yaml" />
</ItemGroup>
A minimal YAML grammar — a tiny calculator:
# grammar.lalr.yaml
# Index 0 is the start symbol. The parser returns as soon as the stack
# settles on the start symbol, so make it distinct from the recursive
# expression — otherwise the first reduction terminates the parse before
# any operator gets matched.
symbols: [S, E, '+', n, WS]
productions:
- derivation: none
rules:
- { lhs: S, rhs: [E] } # start production; default Reduction passes E through
- derivation: leftmost
rules:
- { lhs: E, rhs: [E, '+', E], action: add }
- { lhs: E, rhs: [n], action: number }
lexer:
root:
- { symbol: n, match: '[0-9]+' }
- { symbol: '+', match: '\+' }
- { symbol: WS, match: '[ \t]+', action: ignore }
The generator emits a partial class named after the YAML file
(grammar.lalr.yaml → Grammar), with a populated Schema property, one
record per action, and an IVisitor interface:
using CodeProject.Syntax.LALR;
using CodeProject.Syntax.LALR.LexicalGrammar;
using CodeProject.Syntax.LALR.Schema;
class Calc : Grammar.IVisitor<int>
{
public int Visit(Grammar.Number node) => int.Parse((string)node.Arg0.Content);
public int Visit(Grammar.Add node) => (int)node.Arg0.Content + (int)node.Arg2.Content;
}
var (g, lex) = SchemaCompiler.Compile(Grammar.Schema, Grammar.BuildActions(new Calc()));
var parser = new Parser(g);
using var lexer = PipeBytesLexer.FromString("1 + 2 + 3 + 4", lex);
using var tokens = new AsyncLATokenIterator(lexer);
var result = await parser.ParseInputAsync(tokens);
Console.WriteLine(result.Content); // 10
The full working version is in examples/Calculator/.
For a non-toy example see examples/Json/ — a real JSON
parser in ~50 visitor lines that builds Dictionary<string,object> /
List<object> / primitives.
Design goals
The original 2011 article was a teaching codebase for parse-table construction. This fork keeps the algorithmic core but pursues a different set of properties:
- AOT-first. The library is annotated
IsAotCompatible=trueandIsTrimmable=true. No reflection, nodynamic, no runtime code generation.dotnet publishon the demo executables produces a single ~2.5 MB native exe with zero AOT analyzer warnings. - Bytes-to-tokens, no UTF-16 detour. Input flows from a
PipeReaderdirectly into a UTF-8 byte DFA. There's no per-characterstring, noSystem.Text.RegularExpressions.Regex, and noStreamReaderin the hot loop. UTF-8 decoding happens once per token, at a token boundary, into the token'sContent. - Async at I/O boundaries, sync in hot loops.
PipeReader.ReadAsyncis the onlyawaitper buffer. The byte DFA loops sync acrossReadOnlySequence<byte>segments; the parser loop awaits only when fetching the next token. - Immutable data structures. Every grammar/parser-table value type
(
Production,PrecedenceGroup,Grammar,LR0Item,LR1Item,Action,ParseTable,SymbolName,LexRule, the regex-AST nodes…) is areadonly structwith primary constructors andIEquatable<T>where equality matters. - Typed regex AST. Lexer patterns are built from
IRxcombinators (CharRx,CharRangeRx,CharClassRx,CharSequenceRx,GroupRx,Multiplicity) instead of opaque regex strings. The AST is compiled once, at lexer construction, into a UTF-8 byte DFA — no runtime regex engine. - Grammars as data. YAML files hold the grammar shape; a build-time source generator emits the schema, AST records, and a typed visitor. Inline C# grammars still work, but the YAML path is the recommended one — you write data, the generator does the boilerplate.
- Small public surface. The library targets
net10.0, depends solely on the BCL at runtime (YamlDotNet runs only inside the build-time generator), and exposesGrammar,Parser,PipeBytesLexer,IRxcombinators,LexRule,Item, theSchema/types, and the source-generated YAML surface.
Solution layout
| Project | Type | Purpose |
|---|---|---|
CodeProject.Syntax.LALR/ |
Library (net10.0, AOT-compatible, trimmable) | Grammar model, LALR(1) parse-table generator, runtime parser, lexer infrastructure (PipeBytesLexer, IRx combinators, byte-DFA compiler), Item value type, Schema/ POCOs + compiler. The published NuGet package. |
CodeProject.Syntax.LALR.SourceGenerators/ |
Roslyn analyzer (netstandard2.0, IsRoslynComponent=true) |
YAML grammar source generator. Reads *.lalr.yaml AdditionalFiles at build time, emits <ClassName>.g.cs (schema), <ClassName>.Ast.g.cs (record per action), and <ClassName>.Visitor.g.cs (typed IVisitor interface + BuildActions). YamlDotNet is build-time only. |
Bootstrap/ |
Exe (PublishAot=true) |
Stage 0: hand-codes the BNF meta-grammar in C# and parses a BNF source string with the resulting parser. Reference implementation — depends only on the runtime library, no generator, no YAML. |
Bootstrap.Stage1/ |
Exe (PublishAot=true) |
Stage 1: same BNF meta-grammar, but defined in bnf.lalr.yaml and consumed via the source generator + visitor pipeline. CI diffs stage0 ↔ stage1 Accept output for byte-identical parity. |
TestProject/ |
Exe (PublishAot=true) |
Arithmetic-expression demo with operator precedence and constant folding during reduction. Inline C# grammar; uses an inline tokenizer instead of PipeBytesLexer to show that any IAsyncIterator<Item> plugs in. |
examples/Calculator/ |
Exe | Smallest end-to-end YAML-pipeline demo: a 5-symbol calculator grammar (1 + 2 + 3 + 4 = 10) with two visitor methods. Mirrors the README quick-start. |
examples/Json/ |
Exe | Real JSON parser via the YAML pipeline. ~50-line IVisitor implementation builds Dictionary<string,object> / List<object> / primitives. |
CodeProject.Syntax.LALR.Tests/ |
xUnit v3 (Microsoft.Testing.Platform) | 283 tests covering the regex-AST builders, byte/codepoint DFAs, lexer/parser pipeline, diagnostics, schema layer, the source generator (incl. end-to-end "emit → compile → load → parse"), and parser semantics regressions. |
Shared MSBuild settings (TargetFramework=net10.0, LangVersion=14, deterministic
build, etc.) live in Directory.Build.props. NuGet metadata, symbol packages,
SourceLink, and the bundled-analyzer pack target live on CodeProject.Syntax.LALR.csproj.
Architecture: how a parse happens
Three stages plus the parser. Each implements or consumes
IAsyncIterator<T> / IAsyncLAIterator<T> (the LA variant adds one-token
lookahead).
┌──────────────────────────────────────────────┐
UTF-8 bytes ──▶ │ PipeBytesLexer │ ──▶ Item tokens
(Pipe) │ • per-state byte DFA │
│ • IRx pattern → codepoint NFA → codepoint │
│ DFA → UTF-8 byte DFA, all at construction │
│ • longest match, first-rule-wins on ties │
│ • #pop / #ignore / push-state instructions │
│ • UTF-8 decode once, at token boundary │
└──────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────┐
│ AsyncLATokenIterator │ ──▶ Item + 1-token LA
│ • adapts any IAsyncIterator<Item> to LA │
└──────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────┐
│ Parser │ ──▶ Item (root reduction)
│ • parse table built once in constructor │
│ • ParseInputAsync runs the standard │
│ shift/reduce/accept/error loop │
│ • applies per-production rewriters │
└──────────────────────────────────────────────┘
Stage 1: PipeBytesLexer
LexicalGrammar/PipeBytesLexer.cs. Owns a System.IO.Pipelines.PipeReader,
emits Item tokens.
The lexer is configured with a state table:
IReadOnlyDictionary<string, LexRule[]> where LexRule = (int symbolId, IRx pattern, string instruction).
At construction, each state's rules are compiled into a single byte DFA:
DfaCompiler.CompileManybuilds a codepoint NFA via Thompson construction from the typedIRxAST, then determinizes it with subset construction. Each rule's index becomes its DFA pattern id; on overlap the smaller id wins (so the rule listed first wins on equal-length matches).Utf8DfaLowering.Lowerconverts each codepoint transition[lo..hi] → sinto one or more UTF-8 byte chains, splitting along the 1/2/3/4-byte boundaries and using the standard three-way split (Russ Cox) when leading bytes differ. Surrogate codepoints D800..DFFF are excluded — they have no valid UTF-8 encoding.
The hot loop in MoveNextAsync reads ReadOnlySequence<byte> segments from the
pipe and steps the DFA byte-by-byte (sync). On a longest accepting prefix, the
matched bytes are UTF-8-decoded once into Item.Content; the rule's
Instruction drives the state stack:
| Instruction | Effect |
|---|---|
null / empty |
stay in the current state |
PipeBytesLexer.Ignore ("#ignore") |
drop the matched token, restart scanning |
PipeBytesLexer.PopState ("#pop") |
pop one state off the stack |
| any other string | push that state name |
Factories: PipeBytesLexer.FromBytes(ReadOnlyMemory<byte>, …),
FromStream(Stream, …), FromString(string, …).
Errors fail fast. When input bytes don't match any rule in the current
state, the lexer's LexerErrorMode decides what to do:
| Mode | Effect |
|---|---|
Throw (default) |
Raise LexerException carrying the offending byte, SourcePosition, and lexer state name. |
EmitAndStop |
Emit one error Item (IsError==true, Content is the hex byte), then return false from subsequent MoveNextAsync calls. |
EmitAndSkip |
Emit an error Item, advance the cursor by one byte, continue scanning. |
EmitAndStop and EmitAndSkip require a non-negative errorSymbolId argument
so the emitted Item is identifiable by the consumer. Clean EOF (no bytes
remaining) is always silent — it never triggers any error mode.
Stage 1.5: PipeRuneIterator (alternative input path)
LexicalGrammar/PipeRuneIterator.cs. Standalone codepoint iterator
(IAsyncLAIterator<int>) that decodes UTF-8 bytes directly to Unicode scalars
via Rune.DecodeFromUtf8. Not used by the parse pipeline — PipeBytesLexer
goes from bytes to tokens directly — but available for callers that want
codepoints.
Stage 2: AsyncLATokenIterator
Wraps any IAsyncIterator<Item> and adds one-token lookahead so the parser can
decide between shift and reduce at any state.
Stage 3: Parser
Parser.cs. The Parser constructor builds the LALR(1) parse table eagerly
from the Grammar (failing fast with GrammarConflictException on any
unresolved S/R or R/R conflict). ParseInputAsync(IAsyncLAIterator<Item>, Debug?, …)
runs the standard parse loop, consulting ParseTable.Actions[state, tokenId+1]
for every shift/reduce/accept/error decision. After each reduce the matching
Production's rewriter (Func<int, Item[], object>) builds the reduced node's
Content; productions without a rewriter fall back to a default Reduction.
Visitors that legitimately want to return C# null are honoured —
Production.HasRewriter distinguishes "rewriter returned null" from "no
rewriter at all".
The parse loop honors a ParserErrorMode and a CancellationToken:
| Parameter | Effect |
|---|---|
errorMode: ParserErrorMode.Throw (default) |
Raise ParseErrorException on the first parse error. The exception carries the offending Item, the LALR(1) state, and the set of symbol ids that would have been valid (ExpectedSymbolIds). |
errorMode: ParserErrorMode.Return |
Legacy: return the offending Item (with IsError==true) so the caller can produce diagnostics or feed it into a partial parse tree. |
cancellationToken: ct |
Checked once per parse-loop iteration via ThrowIfCancellationRequested. |
debugger: null (default) |
No tracing — methods on Debug are [Conditional("DEBUG")] anyway, but this skips even the call site. |
Item is the unifying value type for terminal tokens and reduced
non-terminals. Its ContentType is Scalar (raw lexer string), Reduction
(default), or Nested. Tokens that bubble up through IsError propagate error
state through reductions automatically. Every Item carries a SourcePosition
(1-based Line, byte-based 1-based Column, absolute ByteOffset).
Two ways to define a grammar
Inline C# (Bootstrap/Program.cs, TestProject/Main.cs)
new Grammar(
symbolTable, // SymbolName[] — id == index
new PrecedenceGroup(Derivation.None, // tightest binding first
new Production((int)S.Result, (_, x) => x[0], (int)S.Expr)),
new PrecedenceGroup(Derivation.LeftMost,
new Production((int)S.Expr, RewriteBinary, (int)S.Expr, (int)S.Plus, (int)S.Expr)));
| Type | Purpose |
|---|---|
SymbolName(int id, string name) |
Terminal/non-terminal entry; id == index in the symbol table; symbol id 0 is the start symbol. |
Production(int left, Func<int, Item[], object> rewriter, params int[] right) |
A production with an optional semantic rewriter. The rewriter receives the LHS id and the matched right-hand Items; whatever it returns becomes the reduced item's Content. |
PrecedenceGroup(Derivation, params Production[]) |
A bundle of productions sharing precedence. Earlier groups bind more tightly; Derivation (None / LeftMost / RightMost) decides shift-vs-reduce and reduce-vs-reduce conflicts. |
Inline grammars are still useful for small, static cases (and are how stage0
Bootstrap and TestProject are written), but they entangle grammar shape
with action code.
YAML + source generator (recommended)
Author <name>.lalr.yaml next to your code, list it under <AdditionalFiles>
in your csproj, and the source generator emits three companion files:
| Emitted file | Contents |
|---|---|
<Name>.g.cs |
public static partial class <Name> { public static GrammarSchema Schema { get; } = new() { … }; } |
<Name>.Ast.g.cs |
One public sealed record <Action>(Item Arg0, …) per distinct action: name in the YAML. |
<Name>.Visitor.g.cs |
A nested public interface IVisitor<out T> with one T Visit(<Record>) overload per record, plus public static IReadOnlyDictionary<string, Func<int, Item[], object>> BuildActions<T>(IVisitor<T> visitor) that constructs records from the parser's reduction frame and dispatches by C# overload resolution. Use IVisitor<int> for an evaluator, IVisitor<object> when methods need different shapes per production. |
Your code implements IVisitor, calls Schema/SchemaCompiler.Compile(Schema, BuildActions(visitor)),
and feeds the result to a Parser. See the Quick start above and examples/Json/
for a worked example. bnf.lalr.yaml (Bootstrap.Stage1) is a larger reference
showing a multi-state lexer (push-pop quoted strings) and 17 actions.
The YAML schema is documented inline in CodeProject.Syntax.LALR/Schema/GrammarSchema.cs.
The regex dialect for match: patterns is described in
CodeProject.Syntax.LALR/Schema/IRxParser.cs — it's deliberately small (no
alternation; express alternatives as multiple lexer rules).
Examples
Bootstrap — self-describing BNF (stage 0)
Hand-codes a grammar in C# that describes the BNF notation itself, then parses a BNF source string with the resulting parser. Reference implementation that depends only on the runtime library — no source generator, no YAML, always buildable from a clean checkout. Useful as a regression baseline.
Bootstrap.Stage1 — same grammar, YAML-driven (stage 1)
Same BNF meta-grammar, but defined in bnf.lalr.yaml and consumed via the
generator + a typed IVisitor implementation. Demonstrates a full multi-state
lexer (push state on ", pop on the matching close). CI diffs stage0 ↔ stage1
Accept output to catch any regression in the YAML/schema/generator stack.
TestProject — arithmetic with constant folding
+ − × ÷ with parentheses, a per-production rewriter that does constant
folding during reduction (e.g. (24 / 12) + 2 * (3-4) reduces to 0 at parse
time), and an inline tokenizer instead of PipeBytesLexer to show that any
IAsyncIterator<Item> plugs in.
examples/Calculator — minimal YAML demo
Three productions, one lexer state, two visitor methods. Mirrors the README quick-start. The smallest grammar that exercises every interesting bit of the YAML pipeline (start production, leftmost-derivation conflict resolution, ignored whitespace, an emitted record per action).
examples/Json — real JSON via the YAML pipeline
json.lalr.yaml (16 productions) plus a ~50-line IVisitor implementation
that builds Dictionary<string, object> / List<object> / primitives. The
canonical "non-toy grammar end-to-end via the new pipeline" reference. Lexer
limitations (no backslash escapes inside strings) are documented in the YAML
header.
Building, testing, releasing
# Build the whole solution
dotnet build CodeProject.Syntax.LALR.sln -c Debug # or -c Release
# Run all tests (xUnit v3 on Microsoft.Testing.Platform — 283 tests)
dotnet test CodeProject.Syntax.LALR.Tests/CodeProject.Syntax.LALR.Tests.csproj -c Debug
# Run a subset of tests (MTP --filter-method, glob-syntax)
dotnet run --project CodeProject.Syntax.LALR.Tests/CodeProject.Syntax.LALR.Tests.csproj -c Debug -- --filter-method "*PipeBytesLexer*"
# Run the demos / examples end-to-end
dotnet run --project Bootstrap/Bootstrap.csproj -c Release # stage 0 (inline grammar)
dotnet run --project Bootstrap.Stage1/Bootstrap.Stage1.csproj -c Release # stage 1 (YAML pipeline)
dotnet run --project TestProject/TestProject.csproj -c Release # arithmetic + constant folding
dotnet run --project examples/Calculator/Examples.Calculator.csproj -c Release # minimal YAML demo
dotnet run --project examples/Json/Examples.Json.csproj -c Release # real JSON via visitor
# Native AOT publish (verifies library + demos stay AOT-clean)
dotnet publish Bootstrap/Bootstrap.csproj -c Release
dotnet publish Bootstrap.Stage1/Bootstrap.Stage1.csproj -c Release
dotnet publish TestProject/TestProject.csproj -c Release
# Local NuGet pack (runtime + bundled source generator + YamlDotNet)
dotnet pack CodeProject.Syntax.LALR/CodeProject.Syntax.LALR.csproj -c Release -o packages
The demo binaries print step-by-step parse traces in Debug; in Release the
[Conditional("DEBUG")] traces drop out and only the final accept/reject line
is printed.
CI/CD
.github/workflows/dotnet.yml runs on every push and PR:
- Restore, build (Release), test, stage0-vs-stage1 Accept-output diff.
- On master push: also pack the runtime nupkg + snupkg and upload as a build artifact.
- On
v*tag push: verify the tag matches the csproj<Version>, then publish to NuGet using theNUGET_SECRETrepo secret (--skip-duplicatefor idempotent re-runs). The.snupkgis auto-routed to symbols.nuget.org.
To cut a release: bump <Version> in CodeProject.Syntax.LALR/CodeProject.Syntax.LALR.csproj,
commit, tag vX.Y.Z matching the version, push the tag.
Roadmap and known limitations
The project is usable as-is and on NuGet. Items still on the list, ranked:
- Generator-time grammar validation. S/R + R/R conflicts and unknown-symbol
references surface at runtime today (
GrammarConflictException,SchemaCompilationException). LinkingSchemaCompilersource into the generator (same trick asGrammarSchema.cs/Derivation.cs) lets the generator runCompileat build time and emitLALR0003/LALR0004Roslyn diagnostics with file/line locators. Surface is bigger than it looks —SchemaCompilertransitively pulls inLexicalGrammar/(LexRule,IRx- impls,
IRxParser).
- impls,
- Generic typed visitor return.
IVisitor.Visit(…)returnsobject. ATgeneric on the visitor (andBuildActions<T>) would remove the cast at the call site. Defer until a user complains about the cost. - Codepoint columns on
Item.SourcePosition.Columnis byte-based (documented). An optional codepoint-column decoder would help diagnostics- quality positions for non-ASCII grammars. - More example grammars. TOML config, C declaration syntax (the famous
"Lexer Hack"), an arithmetic-with-unary-ops upgrade for
TestProject. Each ships underexamples/. - No alternation inside a single
IRxpattern. Use multipleLexRules in the same state to express alternatives. (Adding|to the AST is small; hasn't been needed yet.) - Async-per-token at the parser/iterator boundary. The lexer's inner loop
is sync and the only
awaitper byte-buffer isPipeReader.ReadAsync. The remaining async-per-token cost lives inIAsyncIterator<Item>callers and the parser loop. A pure-sync,ref struct-style parser is possible; not a priority.
Provenance & license
Originally adapted from Phillip Voyle's LALR Parse Table Generation in C#. The core LALR algorithm follows the GOLD parser pseudocode.
Licensed under the
Code Project Open License (CPOL) 1.02.
See LICENSE.html for the full text and LICENSE.md for a non-binding
summary.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net10.0
- No dependencies.
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.