1023 lines
43 KiB
Plaintext
1023 lines
43 KiB
Plaintext
The Lemon Parser Generator
|
|
|
|
Lemon is an LALR(1) parser generator for C. It does the same job as
|
|
"bison" and "yacc". But Lemon is not a bison or yacc clone. Lemon uses a
|
|
different grammar syntax which is designed to reduce the number of coding
|
|
errors. Lemon also uses a parsing engine that is faster than yacc and
|
|
bison and which is both reentrant and threadsafe. (Update: Since the
|
|
previous sentence was written, bison has also been updated so that it too
|
|
can generate a reentrant and threadsafe parser.) Lemon also implements
|
|
features that can be used to eliminate resource leaks, making it suitable
|
|
for use in long-running programs such as graphical user interfaces or
|
|
embedded controllers.
|
|
|
|
This document is an introduction to the Lemon parser generator.
|
|
|
|
1.0 Table of Contents
|
|
|
|
* Introduction
|
|
* 1.0 Table of Contents
|
|
* 2.0 Security Notes
|
|
* 3.0 Theory of Operation
|
|
* 3.1 Command Line Options
|
|
* 3.2 The Parser Interface
|
|
* 3.2.1 Allocating The Parse Object On Stack
|
|
* 3.2.2 Interface Summary
|
|
* 3.3 Differences With YACC and BISON
|
|
* 3.4 Building The "lemon" Or "lemon.exe" Executable
|
|
* 4.0 Input File Syntax
|
|
* 4.1 Terminals and Nonterminals
|
|
* 4.2 Grammar Rules
|
|
* 4.3 Precedence Rules
|
|
* 4.4 Special Directives
|
|
* 5.0 Error Processing
|
|
* 6.0 History of Lemon
|
|
* 7.0 Copyright
|
|
|
|
2.0 Security Note
|
|
|
|
The language parser code created by Lemon is very robust and is
|
|
well-suited for use in internet-facing applications that need to safely
|
|
process maliciously crafted inputs.
|
|
|
|
The "lemon.exe" command-line tool itself works great when given a valid
|
|
input grammar file and almost always gives helpful error messages for
|
|
malformed inputs. However, it is possible for a malicious user to craft a
|
|
grammar file that will cause lemon.exe to crash. We do not see this as a
|
|
problem, as lemon.exe is not intended to be used with hostile inputs. To
|
|
summarize:
|
|
|
|
* Parser code generated by lemon → Robust and secure
|
|
* The "lemon.exe" command line tool itself → Not so much
|
|
|
|
3.0 Theory of Operation
|
|
|
|
Lemon is computer program that translates a context free grammar (CFG) for
|
|
a particular language into C code that implements a parser for that
|
|
language. The Lemon program has two inputs:
|
|
|
|
* The grammar specification.
|
|
* A parser template file.
|
|
|
|
Typically, only the grammar specification is supplied by the programmer.
|
|
Lemon comes with a default parser template ("lempar.c") that works fine
|
|
for most applications. But the user is free to substitute a different
|
|
parser template if desired.
|
|
|
|
Depending on command-line options, Lemon will generate up to three output
|
|
files.
|
|
|
|
* C code to implement a parser for the input grammar.
|
|
* A header file defining an integer ID for each terminal symbol (or
|
|
"token").
|
|
* An information file that describes the states of the generated parser
|
|
automaton.
|
|
|
|
By default, all three of these output files are generated. The header file
|
|
is suppressed if the "-m" command-line option is used and the report file
|
|
is omitted when "-q" is selected.
|
|
|
|
The grammar specification file uses a ".y" suffix, by convention. In the
|
|
examples used in this document, we'll assume the name of the grammar file
|
|
is "gram.y". A typical use of Lemon would be the following command:
|
|
|
|
lemon gram.y
|
|
|
|
This command will generate three output files named "gram.c", "gram.h" and
|
|
"gram.out". The first is C code to implement the parser. The second is the
|
|
header file that defines numerical values for all terminal symbols, and
|
|
the last is the report that explains the states used by the parser
|
|
automaton.
|
|
|
|
3.1 Command Line Options
|
|
|
|
The behavior of Lemon can be modified using command-line options. You can
|
|
obtain a list of the available command-line options together with a brief
|
|
explanation of what each does by typing
|
|
|
|
lemon "-?"
|
|
|
|
As of this writing, the following command-line options are supported:
|
|
|
|
* -b Show only the basis for each parser state in the report file.
|
|
* -c Do not compress the generated action tables. The parser will be a
|
|
little larger and slower, but it will detect syntax errors sooner.
|
|
* -ddirectory Write all output files into directory. Normally, output
|
|
files are written into the directory that contains the input grammar
|
|
file.
|
|
* -Dname Define C preprocessor macro name. This macro is usable by
|
|
"%ifdef", "%ifndef", and "%if lines in the grammar file.
|
|
* -E Run the "%if" preprocessor step only and print the revised grammar
|
|
file.
|
|
* -g Do not generate a parser. Instead write the input grammar to
|
|
standard output with all comments, actions, and other extraneous text
|
|
removed.
|
|
* -l Omit "#line" directives in the generated parser C code.
|
|
* -m Cause the output C source code to be compatible with the
|
|
"makeheaders" program.
|
|
* -p Display all conflicts that are resolved by precedence rules.
|
|
* -q Suppress generation of the report file.
|
|
* -r Do not sort or renumber the parser states as part of optimization.
|
|
* -s Show parser statistics before exiting.
|
|
* -Tfile Use file as the template for the generated C-code parser
|
|
implementation.
|
|
* -x Print the Lemon version number.
|
|
|
|
3.2 The Parser Interface
|
|
|
|
Lemon doesn't generate a complete, working program. It only generates a
|
|
few subroutines that implement a parser. This section describes the
|
|
interface to those subroutines. It is up to the programmer to call these
|
|
subroutines in an appropriate way in order to produce a complete system.
|
|
|
|
Before a program begins using a Lemon-generated parser, the program must
|
|
first create the parser. A new parser is created as follows:
|
|
|
|
void *pParser = ParseAlloc( malloc );
|
|
|
|
The ParseAlloc() routine allocates and initializes a new parser and
|
|
returns a pointer to it. The actual data structure used to represent a
|
|
parser is opaque — its internal structure is not visible or usable by the
|
|
calling routine. For this reason, the ParseAlloc() routine returns a
|
|
pointer to void rather than a pointer to some particular structure. The
|
|
sole argument to the ParseAlloc() routine is a pointer to the subroutine
|
|
used to allocate memory. Typically this means malloc().
|
|
|
|
After a program is finished using a parser, it can reclaim all memory
|
|
allocated by that parser by calling
|
|
|
|
ParseFree(pParser, free);
|
|
|
|
The first argument is the same pointer returned by ParseAlloc(). The
|
|
second argument is a pointer to the function used to release bulk memory
|
|
back to the system.
|
|
|
|
After a parser has been allocated using ParseAlloc(), the programmer must
|
|
supply the parser with a sequence of tokens (terminal symbols) to be
|
|
parsed. This is accomplished by calling the following function once for
|
|
each token:
|
|
|
|
Parse(pParser, hTokenID, sTokenData, pArg);
|
|
|
|
The first argument to the Parse() routine is the pointer returned by
|
|
ParseAlloc(). The second argument is a small positive integer that tells
|
|
the parser the type of the next token in the data stream. There is one
|
|
token type for each terminal symbol in the grammar. The gram.h file
|
|
generated by Lemon contains #define statements that map symbolic terminal
|
|
symbol names into appropriate integer values. A value of 0 for the second
|
|
argument is a special flag to the parser to indicate that the end of input
|
|
has been reached. The third argument is the value of the given token. By
|
|
default, the type of the third argument is "void*", but the grammar will
|
|
usually redefine this type to be some kind of structure. Typically the
|
|
second argument will be a broad category of tokens such as "identifier" or
|
|
"number" and the third argument will be the name of the identifier or the
|
|
value of the number.
|
|
|
|
The Parse() function may have either three or four arguments, depending on
|
|
the grammar. If the grammar specification file requests it (via the
|
|
%extra_argument directive), the Parse() function will have a fourth
|
|
parameter that can be of any type chosen by the programmer. The parser
|
|
doesn't do anything with this argument except to pass it through to action
|
|
routines. This is a convenient mechanism for passing state information
|
|
down to the action routines without having to use global variables.
|
|
|
|
A typical use of a Lemon parser might look something like the following:
|
|
|
|
1 ParseTree *ParseFile(const char *zFilename){
|
|
2 Tokenizer *pTokenizer;
|
|
3 void *pParser;
|
|
4 Token sToken;
|
|
5 int hTokenId;
|
|
6 ParserState sState;
|
|
7
|
|
8 pTokenizer = TokenizerCreate(zFilename);
|
|
9 pParser = ParseAlloc( malloc );
|
|
10 InitParserState(&sState);
|
|
11 while( GetNextToken(pTokenizer, &hTokenId, &sToken) ){
|
|
12 Parse(pParser, hTokenId, sToken, &sState);
|
|
13 }
|
|
14 Parse(pParser, 0, sToken, &sState);
|
|
15 ParseFree(pParser, free );
|
|
16 TokenizerFree(pTokenizer);
|
|
17 return sState.treeRoot;
|
|
18 }
|
|
|
|
This example shows a user-written routine that parses a file of text and
|
|
returns a pointer to the parse tree. (All error-handling code is omitted
|
|
from this example to keep it simple.) We assume the existence of some kind
|
|
of tokenizer which is created using TokenizerCreate() on line 8 and
|
|
deleted by TokenizerFree() on line 16. The GetNextToken() function on line
|
|
11 retrieves the next token from the input file and puts its type in the
|
|
integer variable hTokenId. The sToken variable is assumed to be some kind
|
|
of structure that contains details about each token, such as its complete
|
|
text, what line it occurs on, etc.
|
|
|
|
This example also assumes the existence of a structure of type ParserState
|
|
that holds state information about a particular parse. An instance of such
|
|
a structure is created on line 6 and initialized on line 10. A pointer to
|
|
this structure is passed into the Parse() routine as the optional 4th
|
|
argument. The action routine specified by the grammar for the parser can
|
|
use the ParserState structure to hold whatever information is useful and
|
|
appropriate. In the example, we note that the treeRoot field of the
|
|
ParserState structure is left pointing to the root of the parse tree.
|
|
|
|
The core of this example as it relates to Lemon is as follows:
|
|
|
|
ParseFile(){
|
|
pParser = ParseAlloc( malloc );
|
|
while( GetNextToken(pTokenizer,&hTokenId, &sToken) ){
|
|
Parse(pParser, hTokenId, sToken);
|
|
}
|
|
Parse(pParser, 0, sToken);
|
|
ParseFree(pParser, free );
|
|
}
|
|
|
|
Basically, what a program has to do to use a Lemon-generated parser is
|
|
first create the parser, then send it lots of tokens obtained by
|
|
tokenizing an input source. When the end of input is reached, the Parse()
|
|
routine should be called one last time with a token type of 0. This step
|
|
is necessary to inform the parser that the end of input has been reached.
|
|
Finally, we reclaim memory used by the parser by calling ParseFree().
|
|
|
|
There is one other interface routine that should be mentioned before we
|
|
move on. The ParseTrace() function can be used to generate debugging
|
|
output from the parser. A prototype for this routine is as follows:
|
|
|
|
ParseTrace(FILE *stream, char *zPrefix);
|
|
|
|
After this routine is called, a short (one-line) message is written to the
|
|
designated output stream every time the parser changes states or calls an
|
|
action routine. Each such message is prefaced using the text given by
|
|
zPrefix. This debugging output can be turned off by calling ParseTrace()
|
|
again with a first argument of NULL (0).
|
|
|
|
3.2.1 Allocating The Parse Object On Stack
|
|
|
|
If all calls to the Parse() interface are made from within %code
|
|
directives, then the parse object can be allocated from the stack rather
|
|
than from the heap. These are the steps:
|
|
* Declare a local variable of type "yyParser"
|
|
* Initialize the variable using ParseInit()
|
|
* Pass a pointer to the variable in calls ot Parse()
|
|
* Deallocate substructure in the parse variable using ParseFinalize().
|
|
|
|
The following code illustrates how this is done:
|
|
|
|
ParseFile(){
|
|
yyParser x;
|
|
ParseInit( &x );
|
|
while( GetNextToken(pTokenizer,&hTokenId, &sToken) ){
|
|
Parse(&x, hTokenId, sToken);
|
|
}
|
|
Parse(&x, 0, sToken);
|
|
ParseFinalize( &x );
|
|
}
|
|
|
|
3.2.2 Interface Summary
|
|
|
|
Here is a quick overview of the C-language interface to a Lemon-generated
|
|
parser:
|
|
|
|
void *ParseAlloc( (void*(*malloc)(size_t) );
|
|
void ParseFree(void *pParser, (void(*free)(void*) );
|
|
void Parse(void *pParser, int tokenCode, ParseTOKENTYPE token, ...);
|
|
void ParseTrace(FILE *stream, char *zPrefix);
|
|
|
|
Notes:
|
|
|
|
* Use the %name directive to change the "Parse" prefix names of the
|
|
procedures in the interface.
|
|
* Use the %token_type directive to define the "ParseTOKENTYPE" type.
|
|
* Use the %extra_argument directive to specify the type and name of the
|
|
4th parameter to the Parse() function.
|
|
|
|
3.3 Differences With YACC and BISON
|
|
|
|
Programmers who have previously used the yacc or bison parser generator
|
|
will notice several important differences between yacc and/or bison and
|
|
Lemon.
|
|
|
|
* In yacc and bison, the parser calls the tokenizer. In Lemon, the
|
|
tokenizer calls the parser.
|
|
* Lemon uses no global variables. Yacc and bison use global variables to
|
|
pass information between the tokenizer and parser.
|
|
* Lemon allows multiple parsers to be running simultaneously. Yacc and
|
|
bison do not.
|
|
|
|
These differences may cause some initial confusion for programmers with
|
|
prior yacc and bison experience. But after years of experience using
|
|
Lemon, I firmly believe that the Lemon way of doing things is better.
|
|
|
|
Updated as of 2016-02-16: The text above was written in the 1990s. We are
|
|
told that Bison has lately been enhanced to support the
|
|
tokenizer-calls-parser paradigm used by Lemon, eliminating the need for
|
|
global variables.
|
|
|
|
3.4 Building The "lemon" or "lemon.exe" Executable
|
|
|
|
The "lemon" or "lemon.exe" program is built from a single file of C-code
|
|
named "lemon.c". The Lemon source code is generic C89 code that uses no
|
|
unusual or non-standard libraries. Any reasonable C compiler should
|
|
suffice to compile the lemon program. A command-line like the following
|
|
will usually work:
|
|
|
|
cc -o lemon lemon.c
|
|
|
|
On Windows machines with Visual C++ installed, bring up a "VS20NN x64
|
|
Native Tools Command Prompt" window and enter:
|
|
|
|
cl lemon.c
|
|
|
|
Compiling Lemon really is that simple. Additional compiler options such as
|
|
"-O2" or "-g" or "-Wall" can be added if desired, but they are not
|
|
necessary.
|
|
|
|
4.0 Input File Syntax
|
|
|
|
The main purpose of the grammar specification file for Lemon is to define
|
|
the grammar for the parser. But the input file also specifies additional
|
|
information Lemon requires to do its job. Most of the work in using Lemon
|
|
is in writing an appropriate grammar file.
|
|
|
|
The grammar file for Lemon is, for the most part, a free format. It does
|
|
not have sections or divisions like yacc or bison. Any declaration can
|
|
occur at any point in the file. Lemon ignores whitespace (except where it
|
|
is needed to separate tokens), and it honors the same commenting
|
|
conventions as C and C++.
|
|
|
|
4.1 Terminals and Nonterminals
|
|
|
|
A terminal symbol (token) is any string of alphanumeric and/or underscore
|
|
characters that begins with an uppercase letter. A terminal can contain
|
|
lowercase letters after the first character, but the usual convention is
|
|
to make terminals all uppercase. A nonterminal, on the other hand, is any
|
|
string of alphanumeric and underscore characters than begins with a
|
|
lowercase letter. Again, the usual convention is to make nonterminals use
|
|
all lowercase letters.
|
|
|
|
In Lemon, terminal and nonterminal symbols do not need to be declared or
|
|
identified in a separate section of the grammar file. Lemon is able to
|
|
generate a list of all terminals and nonterminals by examining the grammar
|
|
rules, and it can always distinguish a terminal from a nonterminal by
|
|
checking the case of the first character of the name.
|
|
|
|
Yacc and bison allow terminal symbols to have either alphanumeric names or
|
|
to be individual characters included in single quotes, like this: ')' or
|
|
'$'. Lemon does not allow this alternative form for terminal symbols. With
|
|
Lemon, all symbols, terminals and nonterminals, must have alphanumeric
|
|
names.
|
|
|
|
4.2 Grammar Rules
|
|
|
|
The main component of a Lemon grammar file is a sequence of grammar rules.
|
|
Each grammar rule consists of a nonterminal symbol followed by the special
|
|
symbol "::=" and then a list of terminals and/or nonterminals. The rule is
|
|
terminated by a period. The list of terminals and nonterminals on the
|
|
right-hand side of the rule can be empty. Rules can occur in any order,
|
|
except that the left-hand side of the first rule is assumed to be the
|
|
start symbol for the grammar (unless specified otherwise using the
|
|
%start_symbol directive described below.) A typical sequence of grammar
|
|
rules might look something like this:
|
|
|
|
expr ::= expr PLUS expr.
|
|
expr ::= expr TIMES expr.
|
|
expr ::= LPAREN expr RPAREN.
|
|
expr ::= VALUE.
|
|
|
|
There is one non-terminal in this example, "expr", and five terminal
|
|
symbols or tokens: "PLUS", "TIMES", "LPAREN", "RPAREN" and "VALUE".
|
|
|
|
Like yacc and bison, Lemon allows the grammar to specify a block of C code
|
|
that will be executed whenever a grammar rule is reduced by the parser. In
|
|
Lemon, this action is specified by putting the C code (contained within
|
|
curly braces {...}) immediately after the period that closes the rule. For
|
|
example:
|
|
|
|
expr ::= expr PLUS expr. { printf("Doing an addition...\n"); }
|
|
|
|
In order to be useful, grammar actions must normally be linked to their
|
|
associated grammar rules. In yacc and bison, this is accomplished by
|
|
embedding a "$$" in the action to stand for the value of the left-hand
|
|
side of the rule and symbols "$1", "$2", and so forth to stand for the
|
|
value of the terminal or nonterminal at position 1, 2 and so forth on the
|
|
right-hand side of the rule. This idea is very powerful, but it is also
|
|
very error-prone. The single most common source of errors in a yacc or
|
|
bison grammar is to miscount the number of symbols on the right-hand side
|
|
of a grammar rule and say "$7" when you really mean "$8".
|
|
|
|
Lemon avoids the need to count grammar symbols by assigning symbolic names
|
|
to each symbol in a grammar rule and then using those symbolic names in
|
|
the action. In yacc or bison, one would write this:
|
|
|
|
expr -> expr PLUS expr { $$ = $1 + $3; };
|
|
|
|
But in Lemon, the same rule becomes the following:
|
|
|
|
expr(A) ::= expr(B) PLUS expr(C). { A = B+C; }
|
|
|
|
In the Lemon rule, any symbol in parentheses after a grammar rule symbol
|
|
becomes a place holder for that symbol in the grammar rule. This place
|
|
holder can then be used in the associated C action to stand for the value
|
|
of that symbol.
|
|
|
|
The Lemon notation for linking a grammar rule with its reduce action is
|
|
superior to yacc/bison on several counts. First, as mentioned above, the
|
|
Lemon method avoids the need to count grammar symbols. Secondly, if a
|
|
terminal or nonterminal in a Lemon grammar rule includes a linking symbol
|
|
in parentheses but that linking symbol is not actually used in the reduce
|
|
action, then an error message is generated. For example, the rule
|
|
|
|
expr(A) ::= expr(B) PLUS expr(C). { A = B; }
|
|
|
|
will generate an error because the linking symbol "C" is used in the
|
|
grammar rule but not in the reduce action.
|
|
|
|
The Lemon notation for linking grammar rules to reduce actions also
|
|
facilitates the use of destructors for reclaiming memory allocated by the
|
|
values of terminals and nonterminals on the right-hand side of a rule.
|
|
|
|
4.3 Precedence Rules
|
|
|
|
Lemon resolves parsing ambiguities in exactly the same way as yacc and
|
|
bison. A shift-reduce conflict is resolved in favor of the shift, and a
|
|
reduce-reduce conflict is resolved by reducing whichever rule comes first
|
|
in the grammar file.
|
|
|
|
Just like in yacc and bison, Lemon allows a measure of control over the
|
|
resolution of parsing conflicts using precedence rules. A precedence value
|
|
can be assigned to any terminal symbol using the %left, %right or
|
|
%nonassoc directives. Terminal symbols mentioned in earlier directives
|
|
have a lower precedence than terminal symbols mentioned in later
|
|
directives. For example:
|
|
|
|
%left AND.
|
|
%left OR.
|
|
%nonassoc EQ NE GT GE LT LE.
|
|
%left PLUS MINUS.
|
|
%left TIMES DIVIDE MOD.
|
|
%right EXP NOT.
|
|
|
|
In the preceding sequence of directives, the AND operator is defined to
|
|
have the lowest precedence. The OR operator is one precedence level
|
|
higher. And so forth. Hence, the grammar would attempt to group the
|
|
ambiguous expression
|
|
|
|
a AND b OR c
|
|
|
|
like this
|
|
|
|
a AND (b OR c).
|
|
|
|
The associativity (left, right or nonassoc) is used to determine the
|
|
grouping when the precedence is the same. AND is left-associative in our
|
|
example, so
|
|
|
|
a AND b AND c
|
|
|
|
is parsed like this
|
|
|
|
(a AND b) AND c.
|
|
|
|
The EXP operator is right-associative, though, so
|
|
|
|
a EXP b EXP c
|
|
|
|
is parsed like this
|
|
|
|
a EXP (b EXP c).
|
|
|
|
The nonassoc precedence is used for non-associative operators. So
|
|
|
|
a EQ b EQ c
|
|
|
|
is an error.
|
|
|
|
The precedence of non-terminals is transferred to rules as follows: The
|
|
precedence of a grammar rule is equal to the precedence of the left-most
|
|
terminal symbol in the rule for which a precedence is defined. This is
|
|
normally what you want, but in those cases where you want the precedence
|
|
of a grammar rule to be something different, you can specify an
|
|
alternative precedence symbol by putting the symbol in square braces after
|
|
the period at the end of the rule and before any C-code. For example:
|
|
|
|
expr = MINUS expr. [NOT]
|
|
|
|
This rule has a precedence equal to that of the NOT symbol, not the MINUS
|
|
symbol as would have been the case by default.
|
|
|
|
With the knowledge of how precedence is assigned to terminal symbols and
|
|
individual grammar rules, we can now explain precisely how parsing
|
|
conflicts are resolved in Lemon. Shift-reduce conflicts are resolved as
|
|
follows:
|
|
|
|
* If either the token to be shifted or the rule to be reduced lacks
|
|
precedence information, then resolve in favor of the shift, but report
|
|
a parsing conflict.
|
|
* If the precedence of the token to be shifted is greater than the
|
|
precedence of the rule to reduce, then resolve in favor of the shift.
|
|
No parsing conflict is reported.
|
|
* If the precedence of the token to be shifted is less than the
|
|
precedence of the rule to reduce, then resolve in favor of the reduce
|
|
action. No parsing conflict is reported.
|
|
* If the precedences are the same and the shift token is
|
|
right-associative, then resolve in favor of the shift. No parsing
|
|
conflict is reported.
|
|
* If the precedences are the same and the shift token is
|
|
left-associative, then resolve in favor of the reduce. No parsing
|
|
conflict is reported.
|
|
* Otherwise, resolve the conflict by doing the shift, and report a
|
|
parsing conflict.
|
|
|
|
Reduce-reduce conflicts are resolved this way:
|
|
|
|
* If either reduce rule lacks precedence information, then resolve in
|
|
favor of the rule that appears first in the grammar, and report a
|
|
parsing conflict.
|
|
* If both rules have precedence and the precedence is different, then
|
|
resolve the dispute in favor of the rule with the highest precedence,
|
|
and do not report a conflict.
|
|
* Otherwise, resolve the conflict by reducing by the rule that appears
|
|
first in the grammar, and report a parsing conflict.
|
|
|
|
4.4 Special Directives
|
|
|
|
The input grammar to Lemon consists of grammar rules and special
|
|
directives. We've described all the grammar rules, so now we'll talk about
|
|
the special directives.
|
|
|
|
Directives in Lemon can occur in any order. You can put them before the
|
|
grammar rules, or after the grammar rules, or in the midst of the grammar
|
|
rules. It doesn't matter. The relative order of directives used to assign
|
|
precedence to terminals is important, but other than that, the order of
|
|
directives in Lemon is arbitrary.
|
|
|
|
Lemon supports the following special directives:
|
|
|
|
* %code
|
|
* %default_destructor
|
|
* %default_type
|
|
* %destructor
|
|
* %else
|
|
* %endif
|
|
* %extra_argument
|
|
* %fallback
|
|
* %if
|
|
* %ifdef
|
|
* %ifndef
|
|
* %include
|
|
* %left
|
|
* %name
|
|
* %nonassoc
|
|
* %parse_accept
|
|
* %parse_failure
|
|
* %right
|
|
* %stack_overflow
|
|
* %stack_size
|
|
* %start_symbol
|
|
* %syntax_error
|
|
* %token_class
|
|
* %token_destructor
|
|
* %token_prefix
|
|
* %token_type
|
|
* %type
|
|
* %wildcard
|
|
|
|
Each of these directives will be described separately in the following
|
|
sections:
|
|
|
|
4.4.1 The %code directive
|
|
|
|
The %code directive is used to specify additional C code that is added to
|
|
the end of the main output file. This is similar to the %include directive
|
|
except that %include is inserted at the beginning of the main output file.
|
|
|
|
%code is typically used to include some action routines or perhaps a
|
|
tokenizer or even the "main()" function as part of the output file.
|
|
|
|
There can be multiple %code directives. The arguments of all %code
|
|
directives are concatenated.
|
|
|
|
4.4.2 The %default_destructor directive
|
|
|
|
The %default_destructor directive specifies a destructor to use for
|
|
non-terminals that do not have their own destructor specified by a
|
|
separate %destructor directive. See the documentation on the %destructor
|
|
directive below for additional information.
|
|
|
|
In some grammars, many different non-terminal symbols have the same data
|
|
type and hence the same destructor. This directive is a convenient way to
|
|
specify the same destructor for all those non-terminals using a single
|
|
statement.
|
|
|
|
4.4.3 The %default_type directive
|
|
|
|
The %default_type directive specifies the data type of non-terminal
|
|
symbols that do not have their own data type defined using a separate
|
|
%type directive.
|
|
|
|
4.4.4 The %destructor directive
|
|
|
|
The %destructor directive is used to specify a destructor for a
|
|
non-terminal symbol. (See also the %token_destructor directive which is
|
|
used to specify a destructor for terminal symbols.)
|
|
|
|
A non-terminal's destructor is called to dispose of the non-terminal's
|
|
value whenever the non-terminal is popped from the stack. This includes
|
|
all of the following circumstances:
|
|
|
|
* When a rule reduces and the value of a non-terminal on the right-hand
|
|
side is not linked to C code.
|
|
* When the stack is popped during error processing.
|
|
* When the ParseFree() function runs.
|
|
|
|
The destructor can do whatever it wants with the value of the
|
|
non-terminal, but its design is to deallocate memory or other resources
|
|
held by that non-terminal.
|
|
|
|
Consider an example:
|
|
|
|
%type nt {void*}
|
|
%destructor nt { free($$); }
|
|
nt(A) ::= ID NUM. { A = malloc( 100 ); }
|
|
|
|
This example is a bit contrived, but it serves to illustrate how
|
|
destructors work. The example shows a non-terminal named "nt" that holds
|
|
values of type "void*". When the rule for an "nt" reduces, it sets the
|
|
value of the non-terminal to space obtained from malloc(). Later, when the
|
|
nt non-terminal is popped from the stack, the destructor will fire and
|
|
call free() on this malloced space, thus avoiding a memory leak. (Note
|
|
that the symbol "$$" in the destructor code is replaced by the value of
|
|
the non-terminal.)
|
|
|
|
It is important to note that the value of a non-terminal is passed to the
|
|
destructor whenever the non-terminal is removed from the stack, unless the
|
|
non-terminal is used in a C-code action. If the non-terminal is used by
|
|
C-code, then it is assumed that the C-code will take care of destroying
|
|
it. More commonly, the value is used to build some larger structure, and
|
|
we don't want to destroy it, which is why the destructor is not called in
|
|
this circumstance.
|
|
|
|
Destructors help avoid memory leaks by automatically freeing allocated
|
|
objects when they go out of scope. To do the same using yacc or bison is
|
|
much more difficult.
|
|
|
|
4.4.5 The %extra_argument directive
|
|
|
|
The %extra_argument directive instructs Lemon to add a 4th parameter to
|
|
the parameter list of the Parse() function it generates. Lemon doesn't do
|
|
anything itself with this extra argument, but it does make the argument
|
|
available to C-code action routines, destructors, and so forth. For
|
|
example, if the grammar file contains:
|
|
|
|
%extra_argument { MyStruct *pAbc }
|
|
|
|
Then the Parse() function generated will have an 4th parameter of type
|
|
"MyStruct*" and all action routines will have access to a variable named
|
|
"pAbc" that is the value of the 4th parameter in the most recent call to
|
|
Parse().
|
|
|
|
The %extra_context directive works the same except that it is passed in on
|
|
the ParseAlloc() or ParseInit() routines instead of on Parse().
|
|
|
|
4.4.6 The %extra_context directive
|
|
|
|
The %extra_context directive instructs Lemon to add a 2nd parameter to the
|
|
parameter list of the ParseAlloc() and ParseInit() functions. Lemon
|
|
doesn't do anything itself with these extra argument, but it does store
|
|
the value make it available to C-code action routines, destructors, and so
|
|
forth. For example, if the grammar file contains:
|
|
|
|
%extra_context { MyStruct *pAbc }
|
|
|
|
Then the ParseAlloc() and ParseInit() functions will have an 2nd parameter
|
|
of type "MyStruct*" and all action routines will have access to a variable
|
|
named "pAbc" that is the value of that 2nd parameter.
|
|
|
|
The %extra_argument directive works the same except that it is passed in
|
|
on the Parse() routine instead of on ParseAlloc()/ParseInit().
|
|
|
|
4.4.7 The %fallback directive
|
|
|
|
The %fallback directive specifies an alternative meaning for one or more
|
|
tokens. The alternative meaning is tried if the original token would have
|
|
generated a syntax error.
|
|
|
|
The %fallback directive was added to support robust parsing of SQL syntax
|
|
in SQLite. The SQL language contains a large assortment of keywords, each
|
|
of which appears as a different token to the language parser. SQL contains
|
|
so many keywords that it can be difficult for programmers to keep up with
|
|
them all. Programmers will, therefore, sometimes mistakenly use an obscure
|
|
language keyword for an identifier. The %fallback directive provides a
|
|
mechanism to tell the parser: "If you are unable to parse this keyword,
|
|
try treating it as an identifier instead."
|
|
|
|
The syntax of %fallback is as follows:
|
|
|
|
%fallback ID TOKEN... .
|
|
|
|
In words, the %fallback directive is followed by a list of token names
|
|
terminated by a period. The first token name is the fallback token — the
|
|
token to which all the other tokens fall back to. The second and
|
|
subsequent arguments are tokens which fall back to the token identified by
|
|
the first argument.
|
|
|
|
4.4.8 The %if directive and its friends
|
|
|
|
The %if, %ifdef, %ifndef, %else, and %endif directives are similar to #if,
|
|
#ifdef, #ifndef, #else, and #endif in the C-preprocessor, just not as
|
|
general. Each of these directives must begin at the left margin. No
|
|
whitespace is allowed between the "%" and the directive name.
|
|
|
|
Grammar text in between "%ifdef MACRO" and the next nested "%endif" is
|
|
ignored unless the "-DMACRO" command-line option is used. Grammar text
|
|
betwen "%ifndef MACRO" and the next nested "%endif" is included except
|
|
when the "-DMACRO" command-line option is used.
|
|
|
|
The text in between "%if CONDITIONAL" and its corresponding %endif is
|
|
included only if CONDITIONAL is true. The CONDITION is one or more macro
|
|
names, optionally connected using the "||" and "&&" binary operators, the
|
|
"!" unary operator, and grouped using balanced parentheses. Each term is
|
|
true if the corresponding macro exists, and false if it does not exist.
|
|
|
|
An optional "%else" directive can occur anywhere in between a %ifdef,
|
|
%ifndef, or %if directive and its corresponding %endif.
|
|
|
|
Note that the argument to %ifdef and %ifndef is intended to be a single
|
|
preprocessor symbol name, not a general expression. Use the "%if"
|
|
directive for general expressions.
|
|
|
|
4.4.9 The %include directive
|
|
|
|
The %include directive specifies C code that is included at the top of the
|
|
generated parser. You can include any text you want — the Lemon parser
|
|
generator copies it blindly. If you have multiple %include directives in
|
|
your grammar file, their values are concatenated so that all %include code
|
|
ultimately appears near the top of the generated parser, in the same order
|
|
as it appeared in the grammar.
|
|
|
|
The %include directive is very handy for getting some extra #include
|
|
preprocessor statements at the beginning of the generated parser. For
|
|
example:
|
|
|
|
%include {#include <unistd.h>}
|
|
|
|
This might be needed, for example, if some of the C actions in the grammar
|
|
call functions that are prototyped in unistd.h.
|
|
|
|
Use the %code directive to add code to the end of the generated parser.
|
|
|
|
4.4.10 The %left directive
|
|
|
|
The %left directive is used (along with the %right and %nonassoc
|
|
directives) to declare precedences of terminal symbols. Every terminal
|
|
symbol whose name appears after a %left directive but before the next
|
|
period (".") is given the same left-associative precedence value.
|
|
Subsequent %left directives have higher precedence. For example:
|
|
|
|
%left AND.
|
|
%left OR.
|
|
%nonassoc EQ NE GT GE LT LE.
|
|
%left PLUS MINUS.
|
|
%left TIMES DIVIDE MOD.
|
|
%right EXP NOT.
|
|
|
|
Note the period that terminates each %left, %right or %nonassoc directive.
|
|
|
|
LALR(1) grammars can get into a situation where they require a large
|
|
amount of stack space if you make heavy use or right-associative
|
|
operators. For this reason, it is recommended that you use %left rather
|
|
than %right whenever possible.
|
|
|
|
4.4.11 The %name directive
|
|
|
|
By default, the functions generated by Lemon all begin with the
|
|
five-character string "Parse". You can change this string to something
|
|
different using the %name directive. For instance:
|
|
|
|
%name Abcde
|
|
|
|
Putting this directive in the grammar file will cause Lemon to generate
|
|
functions named
|
|
|
|
* AbcdeAlloc(),
|
|
* AbcdeFree(),
|
|
* AbcdeTrace(), and
|
|
* Abcde().
|
|
The %name directive allows you to generate two or more different parsers
|
|
and link them all into the same executable.
|
|
|
|
4.4.12 The %nonassoc directive
|
|
|
|
This directive is used to assign non-associative precedence to one or more
|
|
terminal symbols. See the section on precedence rules or on the %left
|
|
directive for additional information.
|
|
|
|
4.4.13 The %parse_accept directive
|
|
|
|
The %parse_accept directive specifies a block of C code that is executed
|
|
whenever the parser accepts its input string. To "accept" an input string
|
|
means that the parser was able to process all tokens without error.
|
|
|
|
For example:
|
|
|
|
%parse_accept {
|
|
printf("parsing complete!\n");
|
|
}
|
|
|
|
4.4.14 The %parse_failure directive
|
|
|
|
The %parse_failure directive specifies a block of C code that is executed
|
|
whenever the parser fails complete. This code is not executed until the
|
|
parser has tried and failed to resolve an input error using is usual error
|
|
recovery strategy. The routine is only invoked when parsing is unable to
|
|
continue.
|
|
|
|
%parse_failure {
|
|
fprintf(stderr,"Giving up. Parser is hopelessly lost...\n");
|
|
}
|
|
|
|
4.4.15 The %right directive
|
|
|
|
This directive is used to assign right-associative precedence to one or
|
|
more terminal symbols. See the section on precedence rules or on the %left
|
|
directive for additional information.
|
|
|
|
4.4.16 The %stack_overflow directive
|
|
|
|
The %stack_overflow directive specifies a block of C code that is executed
|
|
if the parser's internal stack ever overflows. Typically this just prints
|
|
an error message. After a stack overflow, the parser will be unable to
|
|
continue and must be reset.
|
|
|
|
%stack_overflow {
|
|
fprintf(stderr,"Giving up. Parser stack overflow\n");
|
|
}
|
|
|
|
You can help prevent parser stack overflows by avoiding the use of right
|
|
recursion and right-precedence operators in your grammar. Use left
|
|
recursion and and left-precedence operators instead to encourage rules to
|
|
reduce sooner and keep the stack size down. For example, do rules like
|
|
this:
|
|
|
|
list ::= list element. // left-recursion. Good!
|
|
list ::= .
|
|
|
|
Not like this:
|
|
|
|
list ::= element list. // right-recursion. Bad!
|
|
list ::= .
|
|
|
|
4.4.17 The %stack_size directive
|
|
|
|
If stack overflow is a problem and you can't resolve the trouble by using
|
|
left-recursion, then you might want to increase the size of the parser's
|
|
stack using this directive. Put an positive integer after the %stack_size
|
|
directive and Lemon will generate a parse with a stack of the requested
|
|
size. The default value is 100.
|
|
|
|
%stack_size 2000
|
|
|
|
4.4.18 The %start_symbol directive
|
|
|
|
By default, the start symbol for the grammar that Lemon generates is the
|
|
first non-terminal that appears in the grammar file. But you can choose a
|
|
different start symbol using the %start_symbol directive.
|
|
|
|
%start_symbol prog
|
|
|
|
4.4.19 The %syntax_error directive
|
|
|
|
See Error Processing.
|
|
|
|
4.4.20 The %token_class directive
|
|
|
|
Undocumented. Appears to be related to the MULTITERMINAL concept.
|
|
Implementation.
|
|
|
|
4.4.21 The %token_destructor directive
|
|
|
|
The %destructor directive assigns a destructor to a non-terminal symbol.
|
|
(See the description of the %destructor directive above.) The
|
|
%token_destructor directive does the same thing for all terminal symbols.
|
|
|
|
Unlike non-terminal symbols, which may each have a different data type for
|
|
their values, terminals all use the same data type (defined by the
|
|
%token_type directive) and so they use a common destructor. Other than
|
|
that, the token destructor works just like the non-terminal destructors.
|
|
|
|
4.4.22 The %token_prefix directive
|
|
|
|
Lemon generates #defines that assign small integer constants to each
|
|
terminal symbol in the grammar. If desired, Lemon will add a prefix
|
|
specified by this directive to each of the #defines it generates.
|
|
|
|
So if the default output of Lemon looked like this:
|
|
|
|
#define AND 1
|
|
#define MINUS 2
|
|
#define OR 3
|
|
#define PLUS 4
|
|
|
|
You can insert a statement into the grammar like this:
|
|
|
|
%token_prefix TOKEN_
|
|
|
|
to cause Lemon to produce these symbols instead:
|
|
|
|
#define TOKEN_AND 1
|
|
#define TOKEN_MINUS 2
|
|
#define TOKEN_OR 3
|
|
#define TOKEN_PLUS 4
|
|
|
|
4.4.23 The %token_type and %type directives
|
|
|
|
These directives are used to specify the data types for values on the
|
|
parser's stack associated with terminal and non-terminal symbols. The
|
|
values of all terminal symbols must be of the same type. This turns out to
|
|
be the same data type as the 3rd parameter to the Parse() function
|
|
generated by Lemon. Typically, you will make the value of a terminal
|
|
symbol be a pointer to some kind of token structure. Like this:
|
|
|
|
%token_type {Token*}
|
|
|
|
If the data type of terminals is not specified, the default value is
|
|
"void*".
|
|
|
|
Non-terminal symbols can each have their own data types. Typically the
|
|
data type of a non-terminal is a pointer to the root of a parse tree
|
|
structure that contains all information about that non-terminal. For
|
|
example:
|
|
|
|
%type expr {Expr*}
|
|
|
|
Each entry on the parser's stack is actually a union containing instances
|
|
of all data types for every non-terminal and terminal symbol. Lemon will
|
|
automatically use the correct element of this union depending on what the
|
|
corresponding non-terminal or terminal symbol is. But the grammar designer
|
|
should keep in mind that the size of the union will be the size of its
|
|
largest element. So if you have a single non-terminal whose data type
|
|
requires 1K of storage, then your 100 entry parser stack will require 100K
|
|
of heap space. If you are willing and able to pay that price, fine. You
|
|
just need to know.
|
|
|
|
4.4.24 The %wildcard directive
|
|
|
|
The %wildcard directive is followed by a single token name and a period.
|
|
This directive specifies that the identified token should match any input
|
|
token.
|
|
|
|
When the generated parser has the choice of matching an input against the
|
|
wildcard token and some other token, the other token is always used. The
|
|
wildcard token is only matched if there are no alternatives.
|
|
|
|
5.0 Error Processing
|
|
|
|
After extensive experimentation over several years, it has been discovered
|
|
that the error recovery strategy used by yacc is about as good as it gets.
|
|
And so that is what Lemon uses.
|
|
|
|
When a Lemon-generated parser encounters a syntax error, it first invokes
|
|
the code specified by the %syntax_error directive, if any. It then enters
|
|
its error recovery strategy. The error recovery strategy is to begin
|
|
popping the parsers stack until it enters a state where it is permitted to
|
|
shift a special non-terminal symbol named "error". It then shifts this
|
|
non-terminal and continues parsing. The %syntax_error routine will not be
|
|
called again until at least three new tokens have been successfully
|
|
shifted.
|
|
|
|
If the parser pops its stack until the stack is empty, and it still is
|
|
unable to shift the error symbol, then the %parse_failure routine is
|
|
invoked and the parser resets itself to its start state, ready to begin
|
|
parsing a new file. This is what will happen at the very first syntax
|
|
error, of course, if there are no instances of the "error" non-terminal in
|
|
your grammar.
|
|
|
|
6.0 History of Lemon
|
|
|
|
Lemon was originally written by Richard Hipp sometime in the late 1980s on
|
|
a Sun4 Workstation using K&R C. There was a companion LL(1) parser
|
|
generator program named "Lime", the source code to which as been lost.
|
|
|
|
The lemon.c source file was originally many separate files that were
|
|
compiled together to generate the "lemon" executable. Sometime in the
|
|
1990s, the individual source code files were combined together into the
|
|
current single large "lemon.c" source file. You can still see traces of
|
|
original filenames in the code.
|
|
|
|
Since 2001, Lemon has been part of the SQLite project and the source code
|
|
to Lemon has been managed as a part of the SQLite source tree in the
|
|
following files:
|
|
|
|
* tool/lemon.c
|
|
* tool/lempar.c
|
|
* doc/lemon.html
|
|
|
|
7.0 Copyright
|
|
|
|
All of the source code to Lemon, including the template parser file
|
|
"lempar.c" and this documentation file ("lemon.html") are in the public
|
|
domain. You can use the code for any purpose and without attribution.
|
|
|
|
The code comes with no warranty. If it breaks, you get to keep both
|
|
pieces.
|