Introduction Lexical analysis is the foundational first step in building compilers, template engines, and high-throughput data parsers. Writing a lexer by hand using nested loops and if statements is tedious and error-prone. Standard tools like Flex often introduce performance overhead due to virtual function calls and generic table-driven lookups.
When absolute speed is your primary metric, re2c is the industry-standard choice. Unlike traditional table-driven lexers, re2c compiles regular expressions directly into optimized, hard-coded C/C++ conditional jumps (goto statements). This approach minimizes CPU branch mispredictions and eliminates memory lookups. High-profile, performance-critical projects like PHP, Ninja, and SpamAssassin rely on re2c for their scanning needs. Why Choose re2c Over Flex?
Hard-Coded State Machines: Instead of navigating an array-based lookup table at runtime, re2c generates pure C/C++ code. The state machine is built natively into the execution flow.
Zero Overhead: It creates no external library dependencies. The generated code uses basic arrays, pointers, and CPU-friendly conditional jumps.
Flexible Input Models: It does not force you to use a specific buffer structure. It works directly on memory buffers, null-terminated strings, streams, or custom iterator classes.
Storable States: It natively supports lookahead, push-parsing, and asynchronous data streams via storable state configurations. Core Architecture of an re2c Lexer
An re2c source file mixes standard C++ logic with blocks of re2c directives. The directives are wrapped inside special comment blocks: /I@re2c …/.
The engine depends on a set of core API macros or variables that you must define in your code environment to track buffer boundaries:
YYCURSOR: A pointer to the current character being scanned. The engine increments this automatically. YYLIMIT: A pointer to the end of the input buffer.
YYMARKER: A pointer used to save a position for backtracking when a match is ambiguous. Step-by-Step Implementation: Building a JSON-like Tokenizer
Let us build a production-grade integer, identifier, and string tokenizer to demonstrate re2c in action. 1. Define the Token Types
First, establish an enumeration to represent the tokens your scanner will output.
#include Use code with caution. 2. Write the Scanner Function
Next, write the tokenization loop incorporating re2c directives. Save this file as lexer.re.
Token scan(const char &cursor, const char* limit) { const char* start = cursor; const char* marker; /!re2c re2c:api:style = free-form; re2c:define:YYCURSOR = cursor; re2c:define:YYLIMIT = limit; re2c:define:YYMARKER = marker; // Regular expression definitions whitespace = [ ]+; digit = [0-9]+; letter = [a-zA-Z_]; (letter | [0-9]); str = ‘“’ [^”]* ‘“’; // Matching rules * { return Token::Unknown; } “