Regular expressions are handy. Who hasn't written a quick and dirty "parser" here or there that just throws a regex at some input and checks if it matches. In most modern languages there are builtin regex tools with excellent features and of good quality. In C/C++ things are a little more difficult.
You can use the shiny new regex library from the C++11 standard if your compiler supports it, or you can use Boost.Regex. Both are ... let's call it "surprisingly complex". And on top of it, the runtimes are of varying quality.
Also, sometimes you can only use C which would then force you to use another external library.
But there is an alternative. If you have worked with regexes more deeply, you will know that they basically are tiny programs in a weird programming language. The regex engine is the interpreter. Some regex engines are even interpreters with an optimizing JIT compiler. But ... if you can compile stuff during runtime, you can also compile it ahead of time - given you know the program at compile time.
So if you have a fixed regex in your program, there should be tools that can compile your regex into executable code. One such tool is re2c.
re2c is a code generator that takes a bunch of regular expressions and generates a C function that you can compile and call. This can be used for writing the lexer stage in a parser/compiler and thus re2c is also sometimes called a lexer generator. But this is nothing else then a more fancy regex compiler. So let's have a look at an example.
Given this C++ program with re2c annotations:
enum Token {
TK_None,
TK_PositiveDecimal,
TK_NegativeDecimal
};
Token regex_match_decimal(const char *YYCURSOR) {
const char *YYMARKER;
/*!re2c
re2c:define:YYCTYPE = char;
re2c:yyfill:enable = 0;
end = "\x00";
dec = [0-9]+;
sign = "-";
* { return TK_None; }
dec end { return TK_PositiveDecimal; }
sign dec end { return TK_NegativeDecimal; }
*/
}
re2c will generate this lexer:
/* Generated by re2c 1.0.2 on Mon Oct 9 20:41:14 2017 */
enum Token {
TK_None,
TK_PositiveDecimal,
TK_NegativeDecimal
};
Token regex_match_decimal(const char *YYCURSOR) {
const char *YYMARKER;
{
char yych;
yych = *YYCURSOR;
switch (yych) {
case '-': goto yy4;
case '0':
case '1':
case '2':
case '3':
case '4':
case '5':
case '6':
case '7':
case '8':
case '9': goto yy5;
default: goto yy2;
}
yy2:
++YYCURSOR;
yy3:
{ return TK_None; }
yy4:
yych = *(YYMARKER = ++YYCURSOR);
switch (yych) {
case '0':
case '1':
case '2':
case '3':
case '4':
case '5':
case '6':
case '7':
case '8':
case '9': goto yy6;
default: goto yy3;
}
yy5:
yych = *(YYMARKER = ++YYCURSOR);
switch (yych) {
case 0x00: goto yy9;
case '0':
case '1':
case '2':
case '3':
case '4':
case '5':
case '6':
case '7':
case '8':
case '9': goto yy11;
default: goto yy3;
}
yy6:
yych = *++YYCURSOR;
switch (yych) {
case 0x00: goto yy13;
case '0':
case '1':
case '2':
case '3':
case '4':
case '5':
case '6':
case '7':
case '8':
case '9': goto yy6;
default: goto yy8;
}
yy8:
YYCURSOR = YYMARKER;
goto yy3;
yy9:
++YYCURSOR;
{ return TK_PositiveDecimal; }
yy11:
yych = *++YYCURSOR;
switch (yych) {
case 0x00: goto yy9;
case '0':
case '1':
case '2':
case '3':
case '4':
case '5':
case '6':
case '7':
case '8':
case '9': goto yy11;
default: goto yy8;
}
yy13:
++YYCURSOR;
{ return TK_NegativeDecimal; }
}
}
What you can see is that:
- The generated code doesn't need a runtime.
- The generated code is just raw C.
- It's a simple statemachine implementation.
If you like that, have a look at the manual and examples on re2c.org. In the latest versions it even supports utf-8, push matchers and a bunch of compiler extensions to generate even faster matchers.