2209 lines
52 KiB
Plaintext
2209 lines
52 KiB
Plaintext
.\" $OpenBSD: lex.ms,v 1.3 2004/04/06 10:00:32 jmc Exp $
|
|
.\"
|
|
.\" Copyright (C) Caldera International Inc. 2001-2002.
|
|
.\" All rights reserved.
|
|
.\"
|
|
.\" Redistribution and use in source and binary forms, with or without
|
|
.\" modification, are permitted provided that the following conditions
|
|
.\" are met:
|
|
.\" 1. Redistributions of source code and documentation must retain the above
|
|
.\" copyright notice, this list of conditions and the following disclaimer.
|
|
.\" 2. Redistributions in binary form must reproduce the above copyright
|
|
.\" notice, this list of conditions and the following disclaimer in the
|
|
.\" documentation and/or other materials provided with the distribution.
|
|
.\" 3. All advertising materials mentioning features or use of this software
|
|
.\" must display the following acknowledgement:
|
|
.\" This product includes software developed or owned by Caldera
|
|
.\" International, Inc.
|
|
.\" 4. Neither the name of Caldera International, Inc. nor the names of other
|
|
.\" contributors may be used to endorse or promote products derived from
|
|
.\" this software without specific prior written permission.
|
|
.\"
|
|
.\" USE OF THE SOFTWARE PROVIDED FOR UNDER THIS LICENSE BY CALDERA
|
|
.\" INTERNATIONAL, INC. AND CONTRIBUTORS ``AS IS'' AND ANY EXPRESS OR
|
|
.\" IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
|
|
.\" OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
|
|
.\" IN NO EVENT SHALL CALDERA INTERNATIONAL, INC. BE LIABLE FOR ANY DIRECT,
|
|
.\" INDIRECT INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
|
|
.\" (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
|
|
.\" SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
|
|
.\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT,
|
|
.\" STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
|
|
.\" IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
|
|
.\" POSSIBILITY OF SUCH DAMAGE.
|
|
.\"
|
|
.\" @(#)lex.ms 8.2 (Berkeley) 5/24/94
|
|
.\"
|
|
.if n \{\
|
|
.po 5n
|
|
.ll 70n
|
|
.\}
|
|
.EH 'PSD:16-%''Lex \- A Lexical Analyzer Generator'
|
|
.OH 'Lex \- A Lexical Analyzer Generator''PSD:16-%'
|
|
.hc ~
|
|
.bd I 2
|
|
.de TS
|
|
.br
|
|
.nf
|
|
.SP 1v
|
|
.ul 0
|
|
..
|
|
.de TE
|
|
.SP 1v
|
|
.fi
|
|
..
|
|
.\".de PT
|
|
.\".if \\n%>1 'tl ''\s7LEX\s0\s9\(mi%\s0''
|
|
.\".if \\n%>1 'sp
|
|
.\"..
|
|
.ND July 21, 1975
|
|
.\".RP
|
|
.\".TM 75-1274-15 39199 39199-11
|
|
.TL
|
|
Lex \- A Lexical Analyzer ~Generator~
|
|
.AU ``MH 2C-569'' 6377
|
|
M. E. Lesk and E. Schmidt
|
|
.AI
|
|
.\" .MH
|
|
.AB
|
|
.sp
|
|
.bd I 2
|
|
.\".nr PS 8
|
|
.\".nr VS 9
|
|
.\".ps 8
|
|
.\".vs 9p
|
|
Lex helps write programs whose control flow
|
|
is directed by instances of regular
|
|
expressions in the input stream.
|
|
It is well suited for editor-script type transformations and
|
|
for segmenting input in preparation for
|
|
a parsing routine.
|
|
.PP
|
|
Lex source is a table of regular expressions and corresponding program fragments.
|
|
The table is translated to a program
|
|
which reads an input stream, copying it to an output stream
|
|
and partitioning the input
|
|
into strings which match the given expressions.
|
|
As each such string is recognized the corresponding
|
|
program fragment is executed.
|
|
The recognition of the expressions
|
|
is performed by a deterministic finite automaton
|
|
generated by Lex.
|
|
The program fragments written by the user are executed in the order in which the
|
|
corresponding regular expressions occur in the input stream.
|
|
.\" .if n .if \n(tm .ig
|
|
.PP
|
|
The lexical analysis
|
|
programs written with Lex accept ambiguous specifications
|
|
and choose the longest
|
|
match possible at each input point.
|
|
If necessary, substantial look~ahead
|
|
is performed on the input, but the
|
|
input stream will be backed up to the
|
|
end of the current partition, so that the user
|
|
has general freedom to manipulate it.
|
|
.PP
|
|
Lex can generate analyzers in either C or C++. \**
|
|
.FS
|
|
Some versions of lex were able to produce Ratfor scanners.
|
|
Ratfor is a language which can be translated automatically to portable Fortran.
|
|
This implementation of lex does not support such scanners.
|
|
.FE
|
|
This manual, however, will only discuss generating analyzers
|
|
in C on the UNIX system.
|
|
For details on generating C++ scanners, see the manual page for lex(1).
|
|
Lex is designed to simplify
|
|
interfacing with Yacc, for those
|
|
with access to this compiler-compiler system.
|
|
\&..
|
|
.\".nr PS 9
|
|
.\".nr VS 11
|
|
.AE
|
|
.\" .2C
|
|
.NH
|
|
Introduction.
|
|
.PP
|
|
Lex is a program generator designed for
|
|
lexical processing of character input streams.
|
|
It accepts a high-level, problem oriented specification
|
|
for character string matching,
|
|
and
|
|
produces a program in a general purpose language which recognizes
|
|
regular expressions.
|
|
The regular expressions are specified by the user in the
|
|
source specifications given to Lex.
|
|
The Lex written code recognizes these expressions
|
|
in an input stream and partitions the input stream into
|
|
strings matching the expressions. At the bound~aries
|
|
between strings
|
|
program sections
|
|
provided by the user are executed.
|
|
The Lex source file associates the regular expressions and the
|
|
program fragments.
|
|
As each expression appears in the input to the program written by Lex,
|
|
the corresponding fragment is executed.
|
|
.PP
|
|
.de MH
|
|
Bell Laboratories, Murray Hill, NJ 07974.
|
|
..
|
|
The user supplies the additional code
|
|
beyond expression matching
|
|
needed to complete his tasks, possibly
|
|
including code written by other generators.
|
|
The program that recognizes the expressions is generated in the
|
|
general purpose programming language employed for the
|
|
user's program fragments.
|
|
Thus, a high level expression
|
|
language is provided to write the string expressions to be
|
|
matched while the user's freedom to write actions
|
|
is unimpaired.
|
|
This avoids forcing the user who wishes to use a string manipulation
|
|
language for input analysis to write processing programs in the same
|
|
and often inappropriate string handling language.
|
|
.PP
|
|
Lex is not a complete language, but rather a generator representing
|
|
a new language feature which can be added to
|
|
different programming languages, called ``host languages.''
|
|
Just as general purpose languages
|
|
can produce code to run on different computer hardware,
|
|
Lex can write code in different host languages.
|
|
The host language is used for the output code generated by Lex
|
|
and also for the program fragments added by the user.
|
|
Compatible run-time libraries for the different host languages
|
|
are also provided.
|
|
This makes Lex adaptable to different environments and
|
|
different users.
|
|
Each application
|
|
may be directed to the combination of hardware and host language appropriate
|
|
to the task, the user's background, and the properties of local
|
|
implementations.
|
|
At present, the only supported host languages are C and C++,
|
|
although Fortran (in the form of Ratfor [2]) has been available
|
|
in the past.
|
|
Lex itself exists on UNIX, GCOS, and OS/370; but the
|
|
code generated by Lex may be taken anywhere the appropriate
|
|
compilers exist.
|
|
.PP
|
|
Lex turns the user's expressions and actions
|
|
(called
|
|
.ul
|
|
source
|
|
in this memo) into the host general-purpose language;
|
|
the generated program is named
|
|
.ul
|
|
yylex.
|
|
The
|
|
.ul
|
|
yylex
|
|
program
|
|
will recognize expressions
|
|
in a stream
|
|
(called
|
|
.ul
|
|
input
|
|
in this memo)
|
|
and perform the specified actions for each expression as it is detected.
|
|
See Figure 1.
|
|
.DS C
|
|
Source \(-> Lex \(-> yylex
|
|
|
|
|
|
Input \(-> yylex \(-> Output
|
|
|
|
|
|
An overview of Lex
|
|
Figure 1
|
|
.DE
|
|
.PP
|
|
For a trivial example, consider a program to delete
|
|
from the input
|
|
all blanks or tabs at the ends of lines.
|
|
.DS I
|
|
%%
|
|
[ \et]+$ ;
|
|
.DE
|
|
.LP
|
|
is all that is required.
|
|
The program
|
|
contains a %% delimiter to mark the beginning of the rules, and
|
|
one rule.
|
|
This rule contains a regular expression
|
|
which matches one or more
|
|
instances of the characters blank or tab
|
|
(written \et for visibility, in accordance with the C language convention)
|
|
just prior to the end of a line.
|
|
The brackets indicate the character
|
|
class made of blank and tab; the + indicates ``one or more ...'';
|
|
and the $ indicates ``end of line,'' as in QED.
|
|
No action is specified,
|
|
so the program generated by Lex (yylex) will ignore these characters.
|
|
Everything else will be copied.
|
|
To change any remaining
|
|
string of blanks or tabs to a single blank,
|
|
add another rule:
|
|
.DS I
|
|
%%
|
|
[ \et]+$ ;
|
|
[ \et]+ printf(" ");
|
|
.DE
|
|
.LP
|
|
The finite automaton generated for this
|
|
source will scan for both rules at once,
|
|
observing at
|
|
the termination of the string of blanks or tabs
|
|
whether or not there is a newline character, and executing
|
|
the desired rule action.
|
|
The first rule matches all strings of blanks or tabs
|
|
at the end of lines, and the second
|
|
rule all remaining strings of blanks or tabs.
|
|
.PP
|
|
Lex can be used alone for simple transformations, or
|
|
for analysis and statistics gathering on a lexical level.
|
|
Lex can also be used with a parser generator
|
|
to perform the lexical analysis phase; it is particularly
|
|
easy to interface Lex and Yacc [3].
|
|
Lex programs recognize only regular expressions;
|
|
Yacc writes parsers that accept a large class of context free grammars,
|
|
but require a lower level analyzer to recognize input tokens.
|
|
Thus, a combination of Lex and Yacc is often appropriate.
|
|
When used as a preprocessor for a later parser generator,
|
|
Lex is used to partition the input stream,
|
|
and the parser generator assigns structure to
|
|
the resulting pieces.
|
|
The flow of control
|
|
in such a case (which might be the first half of a compiler,
|
|
for example) is shown in Figure 2.
|
|
Additional programs,
|
|
written by other generators
|
|
or by hand, can
|
|
be added easily to programs written by Lex.
|
|
.\" .BS 2
|
|
.ps 9
|
|
.vs 11
|
|
.DS C
|
|
lexical grammar
|
|
rules rules
|
|
\(da \(da
|
|
|
|
Lex Yacc
|
|
|
|
\(da \(da
|
|
|
|
Input \(-> yylex \(-> yyparse \(-> Parsed input
|
|
|
|
|
|
Lex with Yacc
|
|
Figure 2
|
|
.DE
|
|
.ps 10
|
|
.vs 12
|
|
.\" .BE
|
|
.LP
|
|
Yacc users
|
|
will realize that the name
|
|
.ul
|
|
yylex
|
|
is what Yacc expects its lexical analyzer to be named,
|
|
so that the use of this name by Lex simplifies
|
|
interfacing.
|
|
.PP
|
|
Lex generates a deterministic finite automaton from the regular expressions
|
|
in the source [4].
|
|
The automaton is interpreted, rather than compiled, in order
|
|
to save space.
|
|
The result is still a fast analyzer.
|
|
In particular, the time taken by a Lex program
|
|
to recognize and partition an input stream is
|
|
proportional to the length of the input.
|
|
The number of Lex rules or
|
|
the complexity of the rules is
|
|
not important in determining speed,
|
|
unless rules which include
|
|
forward context require a significant amount of re~scanning.
|
|
What does increase with the number and complexity of rules
|
|
is the size of the finite
|
|
automaton, and therefore the size of the program
|
|
generated by Lex.
|
|
.PP
|
|
In the program written by Lex, the user's fragments
|
|
(representing the
|
|
.ul
|
|
actions
|
|
to be performed as each regular expression
|
|
is found)
|
|
are gathered
|
|
as cases of a switch.
|
|
The automaton interpreter directs the control flow.
|
|
Opportunity is provided for the user to insert either
|
|
declarations or additional statements in the routine containing
|
|
the actions, or to
|
|
add subroutines outside this action routine.
|
|
.PP
|
|
Lex is not limited to source which can
|
|
be interpreted on the basis of one character
|
|
look~ahead.
|
|
For example,
|
|
if there are two rules, one looking for
|
|
.I ab
|
|
and another for
|
|
.I abcdefg ,
|
|
and the input stream is
|
|
.I abcdefh ,
|
|
Lex will recognize
|
|
.I ab
|
|
and leave
|
|
the input pointer just before
|
|
.I "cd. . ."
|
|
Such backup is more costly
|
|
than the processing of simpler languages.
|
|
.NH
|
|
Lex Source.
|
|
.PP
|
|
The general format of Lex source is:
|
|
.DS I
|
|
{definitions}
|
|
%%
|
|
{rules}
|
|
%%
|
|
{user subroutines}
|
|
.DE
|
|
.LP
|
|
where the definitions and the user subroutines
|
|
are often omitted.
|
|
The second
|
|
.I %%
|
|
is optional, but the first is required
|
|
to mark the beginning of the rules.
|
|
The absolute minimum Lex program is thus
|
|
.DS I
|
|
%%
|
|
.DE
|
|
.LP
|
|
(no definitions, no rules) which translates into a program
|
|
which copies the input to the output unchanged.
|
|
.PP
|
|
In the outline of Lex programs shown above, the
|
|
.I
|
|
rules
|
|
.R
|
|
represent the user's control
|
|
decisions; they are a table, in which the left column
|
|
contains
|
|
.I
|
|
regular expressions
|
|
.R
|
|
(see section 3)
|
|
and the right column contains
|
|
.I
|
|
actions,
|
|
.R
|
|
program fragments to be executed when the expressions
|
|
are recognized.
|
|
Thus an individual rule might appear
|
|
.DS I
|
|
integer printf("found keyword INT");
|
|
.DE
|
|
.LP
|
|
to look for the string
|
|
.I integer
|
|
in the input stream and
|
|
print the message ``found keyword INT'' whenever it appears.
|
|
In this example the host procedural language is C and
|
|
the C library function
|
|
.I
|
|
printf
|
|
.R
|
|
is used to print the string.
|
|
The end
|
|
of the expression is indicated by the first blank or tab character.
|
|
If the action is merely a single C expression,
|
|
it can just be given on the right side of the line; if it is
|
|
compound, or takes more than a line, it should be enclosed in
|
|
braces.
|
|
As a slightly more useful example, suppose it is desired to
|
|
change a number of words from British to American spelling.
|
|
Lex rules such as
|
|
.DS I
|
|
colour printf("color");
|
|
mechanise printf("mechanize");
|
|
petrol printf("gas");
|
|
.DE
|
|
.LP
|
|
would be a start. These rules are not quite enough,
|
|
since
|
|
the word
|
|
.I petroleum
|
|
would become
|
|
.I gaseum ;
|
|
a way of dealing
|
|
with this will be described later.
|
|
.NH
|
|
Lex Regular Expressions.
|
|
.PP
|
|
The definitions of regular expressions are very similar to those
|
|
in QED [5].
|
|
A regular
|
|
expression specifies a set of strings to be matched.
|
|
It contains text characters (which match the corresponding
|
|
characters in the strings being compared)
|
|
and operator characters (which specify
|
|
repetitions, choices, and other features).
|
|
The letters of the alphabet and the digits are
|
|
always text characters; thus the regular expression
|
|
.DS I
|
|
integer
|
|
.DE
|
|
.LP
|
|
matches the string
|
|
.ul
|
|
integer
|
|
wherever it appears
|
|
and the expression
|
|
.DS I
|
|
a57D
|
|
.DE
|
|
.LP
|
|
looks for the string
|
|
.ul
|
|
a57D.
|
|
.PP
|
|
.I
|
|
Operators.
|
|
.R
|
|
The operator characters are
|
|
.DS I
|
|
" \e [ ] ^ \- ? . \(** + | ( ) $ / { } % < >
|
|
.DE
|
|
.LP
|
|
and if they are to be used as text characters, an escape
|
|
should be used.
|
|
The quotation mark operator (")
|
|
indicates that whatever is contained between a pair of quotes
|
|
is to be taken as text characters.
|
|
Thus
|
|
.DS I
|
|
xyz"++"
|
|
.DE
|
|
.LP
|
|
matches the string
|
|
.I xyz++
|
|
when it appears. Note that a part of a string may be quoted.
|
|
It is harmless but unnecessary to quote an ordinary
|
|
text character; the expression
|
|
.DS I
|
|
"xyz++"
|
|
.DE
|
|
.LP
|
|
is the same as the one above.
|
|
Thus by quoting every non-alphanumeric character
|
|
being used as a text character, the user can avoid remembering
|
|
the list above of current
|
|
operator characters, and is safe should further extensions to Lex
|
|
lengthen the list.
|
|
.PP
|
|
An operator character may also be turned into a text character
|
|
by preceding it with \e as in
|
|
.DS I
|
|
xyz\e+\e+
|
|
.DE
|
|
.LP
|
|
which
|
|
is another, less readable, equivalent of the above expressions.
|
|
Another use of the quoting mechanism is to get a blank into
|
|
an expression; normally, as explained above, blanks or tabs end
|
|
a rule.
|
|
Any blank character not contained within [\|] (see below) must
|
|
be quoted.
|
|
Several normal C escapes with \e
|
|
are recognized: \en is newline, \et is tab, and \eb is backspace.
|
|
To enter \e itself, use \e\e.
|
|
Since newline is illegal in an expression, \en must be used;
|
|
it is not
|
|
required to escape tab and backspace.
|
|
Every character but blank, tab, newline and the list above is always
|
|
a text character.
|
|
.PP
|
|
.I
|
|
Character classes.
|
|
.R
|
|
Classes of characters can be specified using the operator pair [\|].
|
|
The construction
|
|
.I [abc]
|
|
matches a
|
|
single character, which may be
|
|
.I a ,
|
|
.I b ,
|
|
or
|
|
.I c .
|
|
Within square brackets,
|
|
most operator meanings are ignored.
|
|
Only three characters are special:
|
|
these are \e \(mi and ^. The \(mi character
|
|
indicates ranges. For example,
|
|
.DS I
|
|
[a\(miz0\(mi9<>_]
|
|
.DE
|
|
.LP
|
|
indicates the character class containing all the lower case letters,
|
|
the digits,
|
|
the angle brackets, and underline.
|
|
.\" Ranges may be given in either order.
|
|
Using \(mi between any pair of characters which are
|
|
not both upper case letters, both lower case letters, or both digits
|
|
is implementation dependent.
|
|
.\" and will get a warning message.
|
|
(E.g., [0\-z] in ASCII is many more characters
|
|
than it is in EBCDIC).
|
|
If it is desired to include the
|
|
character \(mi in a character class, it should be first or
|
|
last; thus
|
|
.DS I
|
|
[\(mi+0\(mi9]
|
|
.DE
|
|
.LP
|
|
matches all the digits and the two signs.
|
|
.PP
|
|
In character classes,
|
|
the ^ operator must appear as the first character
|
|
after the left bracket; it indicates that the resulting string
|
|
is to be complemented with respect to the computer character set.
|
|
Thus
|
|
.DS I
|
|
[^abc]
|
|
.DE
|
|
.LP
|
|
matches all characters except a, b, or c, including
|
|
all special or control characters; or
|
|
.DS I
|
|
[^a\-zA\-Z]
|
|
.DE
|
|
.LP
|
|
is any character which is not a letter.
|
|
The \e character provides the usual escapes within
|
|
character class brackets.
|
|
.PP
|
|
.I
|
|
Arbitrary character.
|
|
.R
|
|
To match almost any character, the operator character
|
|
.DS I
|
|
\&.
|
|
.DE
|
|
.LP
|
|
is the class of all characters except newline.
|
|
Escaping into octal is possible although non-portable:
|
|
.DS I
|
|
[\e40\-\e176]
|
|
.DE
|
|
.LP
|
|
matches all printable characters in the ASCII character set, from octal
|
|
40 (blank) to octal 176 (tilde).
|
|
.PP
|
|
.I
|
|
Optional expressions.
|
|
.R
|
|
The operator
|
|
.I ?
|
|
indicates
|
|
an optional element of an expression.
|
|
Thus
|
|
.DS I
|
|
ab?c
|
|
.DE
|
|
.LP
|
|
matches either
|
|
.I ac
|
|
or
|
|
.I abc .
|
|
.PP
|
|
.I
|
|
Repeated expressions.
|
|
.R
|
|
Repetitions of classes are indicated by the operators
|
|
.I \(**
|
|
and
|
|
.I + .
|
|
.DS I
|
|
\f2a\(**\f1
|
|
.DE
|
|
.LP
|
|
is any number of consecutive
|
|
.I a
|
|
characters, including zero; while
|
|
.DS I
|
|
a+
|
|
.DE
|
|
.LP
|
|
is one or more instances of
|
|
.I a.
|
|
For example,
|
|
.DS I
|
|
[a\-z]+
|
|
.DE
|
|
.LP
|
|
is all strings of lower case letters.
|
|
And
|
|
.DS
|
|
[A\(miZa\(miz][A\(miZa\(miz0\(mi9]\(**
|
|
.DE
|
|
.LP
|
|
indicates all alphanumeric strings with a leading
|
|
alphabetic character.
|
|
This is a typical expression for recognizing identifiers in
|
|
computer languages.
|
|
.PP
|
|
.I
|
|
Alternation and Grouping.
|
|
.R
|
|
The operator |
|
|
indicates alternation:
|
|
.DS I
|
|
(ab\||\|cd)
|
|
.DE
|
|
.LP
|
|
matches either
|
|
.ul
|
|
ab
|
|
or
|
|
.ul
|
|
cd.
|
|
Note that parentheses are used for grouping, although
|
|
they are
|
|
not necessary on the outside level;
|
|
.DS I
|
|
ab\||\|cd
|
|
.DE
|
|
.LP
|
|
would have sufficed.
|
|
Parentheses
|
|
can be used for more complex expressions:
|
|
.DS I
|
|
(ab\||\|cd+)?(ef)\(**
|
|
.DE
|
|
.LP
|
|
matches such strings as
|
|
.I abefef ,
|
|
.I efefef ,
|
|
.I cdef ,
|
|
or
|
|
.I cddd\| ;
|
|
but not
|
|
.I abc ,
|
|
.I abcd ,
|
|
or
|
|
.I abcdef .
|
|
.PP
|
|
.I
|
|
Context sensitivity.
|
|
.R
|
|
Lex will recognize a small amount of surrounding
|
|
context. The two simplest operators for this are
|
|
.I ^
|
|
and
|
|
.I $ .
|
|
If the first character of an expression is
|
|
.I ^ ,
|
|
the expression will only be matched at the beginning
|
|
of a line (after a newline character, or at the beginning of
|
|
the input stream).
|
|
This can never conflict with the other meaning of
|
|
.I ^ ,
|
|
complementation
|
|
of character classes, since that only applies within
|
|
the [\|] operators.
|
|
If the very last character is
|
|
.I $ ,
|
|
the expression will only be matched at the end of a line (when
|
|
immediately followed by newline).
|
|
The latter operator is a special case of the
|
|
.I /
|
|
operator character,
|
|
which indicates trailing context.
|
|
The expression
|
|
.DS I
|
|
ab/cd
|
|
.DE
|
|
.LP
|
|
matches the string
|
|
.I ab ,
|
|
but only if followed by
|
|
.ul
|
|
cd.
|
|
Thus
|
|
.DS I
|
|
ab$
|
|
.DE
|
|
.LP
|
|
is the same as
|
|
.DS I
|
|
ab/\en
|
|
.DE
|
|
.LP
|
|
Left context is handled in Lex by
|
|
.I
|
|
start conditions
|
|
.R
|
|
as explained in section 10. If a rule is only to be executed
|
|
when the Lex automaton interpreter is in start condition
|
|
.I
|
|
x,
|
|
.R
|
|
the rule should be prefixed by
|
|
.DS I
|
|
<x>
|
|
.DE
|
|
.LP
|
|
using the angle bracket operator characters.
|
|
If we considered ``being at the beginning of a line'' to be
|
|
start condition
|
|
.I ONE ,
|
|
then the ^ operator
|
|
would be equivalent to
|
|
.DS I
|
|
<ONE>
|
|
.DE
|
|
.LP
|
|
Start conditions are explained more fully later.
|
|
.PP
|
|
.I
|
|
Repetitions and Definitions.
|
|
.R
|
|
The operators {} specify
|
|
either repetitions (if they enclose numbers)
|
|
or
|
|
definition expansion (if they enclose a name). For example
|
|
.DS I
|
|
{digit}
|
|
.DE
|
|
.LP
|
|
looks for a predefined string named
|
|
.I digit
|
|
and inserts it
|
|
at that point in the expression.
|
|
The definitions are given in the first part of the Lex
|
|
input, before the rules.
|
|
In contrast,
|
|
.DS I
|
|
a{1,5}
|
|
.DE
|
|
.LP
|
|
looks for 1 to 5 occurrences of
|
|
.I a .
|
|
.PP
|
|
Finally, initial
|
|
.I %
|
|
is special, being the separator
|
|
for Lex source segments.
|
|
.NH
|
|
Lex Actions.
|
|
.PP
|
|
When an expression written as above is matched, Lex
|
|
executes the corresponding action. This section describes
|
|
some features of Lex which aid in writing actions. Note
|
|
that there is a default action, which
|
|
consists of copying the input to the output. This
|
|
is performed on all strings not otherwise matched. Thus
|
|
the Lex user who wishes to absorb the entire input, without
|
|
producing any output, must provide rules to match everything.
|
|
When Lex is being used with Yacc, this is the normal
|
|
situation.
|
|
One may consider that actions are what is done instead of
|
|
copying the input to the output; thus, in general,
|
|
a rule which merely copies can be omitted.
|
|
Also, a character combination
|
|
which is omitted from the rules
|
|
and which appears as input
|
|
is likely to be printed on the output, thus calling
|
|
attention to the gap in the rules.
|
|
.PP
|
|
One of the simplest things that can be done is to ignore
|
|
the input. Specifying a C null statement (`;') as an action
|
|
causes this result. A frequent rule is
|
|
.DS I
|
|
[ \et\en] ;
|
|
.DE
|
|
.LP
|
|
which causes the three spacing characters (blank, tab, and newline)
|
|
to be ignored.
|
|
.PP
|
|
Another easy way to avoid writing actions is the action character
|
|
|, which indicates that the action for this rule is the action
|
|
for the next rule.
|
|
The previous example could also have been written
|
|
.DS I
|
|
" " |
|
|
"\et" |
|
|
"\en" ;
|
|
.DE
|
|
.LP
|
|
with the same result, although in different style.
|
|
The quotes around \en and \et are not required.
|
|
.PP
|
|
In more complex actions, the user
|
|
will
|
|
often want to know the actual text that matched some expression
|
|
like
|
|
.I [a\(miz]+ .
|
|
Lex leaves this text in an external character
|
|
array named
|
|
.I
|
|
yytext.
|
|
.R
|
|
Thus, to print the name found,
|
|
a rule like
|
|
.DS I
|
|
[a\-z]+ printf("%s", yytext);
|
|
.DE
|
|
.LP
|
|
will print
|
|
the string in
|
|
.I
|
|
yytext.
|
|
.R
|
|
The C function
|
|
.I
|
|
printf
|
|
.R
|
|
accepts a format argument and data to be printed;
|
|
in this case, the format is ``print string'' (% indicating
|
|
data conversion, and
|
|
.I s
|
|
indicating string type),
|
|
and the data are the characters
|
|
in
|
|
.I
|
|
yytext.
|
|
.R
|
|
So this just places
|
|
the matched string
|
|
on the output.
|
|
This action
|
|
is so common that
|
|
it may be written as ECHO:
|
|
.DS I
|
|
[a\-z]+ ECHO;
|
|
.DE
|
|
.LP
|
|
is the same as the above.
|
|
Since the default action is just to
|
|
print the characters found, one might ask why
|
|
give a rule, like this one, which merely specifies
|
|
the default action?
|
|
Such rules are often required
|
|
to avoid matching some other rule
|
|
which is not desired. For example, if there is a rule
|
|
which matches
|
|
.I read
|
|
it will normally match the instances of
|
|
.I read
|
|
contained in
|
|
.I bread
|
|
or
|
|
.I readjust ;
|
|
to avoid
|
|
this,
|
|
a rule
|
|
of the form
|
|
.I [a\(miz]+
|
|
is needed.
|
|
This is explained further below.
|
|
.PP
|
|
Sometimes it is more convenient to know the end of what
|
|
has been found; hence Lex also provides a count
|
|
.I
|
|
yyleng
|
|
.R
|
|
of the number of characters matched.
|
|
To count both the number
|
|
of words and the number of characters in words in the input, the user might write
|
|
.DS I
|
|
[a\-zA\-Z]+ {words++; chars += yyleng;}
|
|
.DE
|
|
.LP
|
|
which accumulates in
|
|
.ul
|
|
chars
|
|
the number
|
|
of characters in the words recognized.
|
|
The last character in the string matched can
|
|
be accessed by
|
|
.DS I
|
|
yytext[yyleng\-1]
|
|
.DE
|
|
.PP
|
|
Occasionally, a Lex
|
|
action may decide that a rule has not recognized the correct
|
|
span of characters.
|
|
Two routines are provided to aid with this situation.
|
|
First,
|
|
.I
|
|
yymore()
|
|
.R
|
|
can be called to indicate that the next input expression recognized is to be
|
|
tacked on to the end of this input. Normally,
|
|
the next input string would overwrite the current
|
|
entry in
|
|
.I
|
|
yytext.
|
|
.R
|
|
Second,
|
|
.I
|
|
yyless (n)
|
|
.R
|
|
may be called to indicate that not all the characters matched
|
|
by the currently successful expression are wanted right now.
|
|
The argument
|
|
.I
|
|
n
|
|
.R
|
|
indicates the number of characters
|
|
in
|
|
.I
|
|
yytext
|
|
.R
|
|
to be retained.
|
|
Further characters previously matched
|
|
are
|
|
returned to the input. This provides the same sort of
|
|
look~ahead offered by the / operator,
|
|
but in a different form.
|
|
.PP
|
|
.I
|
|
Example:
|
|
.R
|
|
Consider a language which defines
|
|
a string as a set of characters between quotation (") marks, and provides that
|
|
to include a " in a string it must be preceded by a \e. The
|
|
regular expression which matches that is somewhat confusing,
|
|
so that it might be preferable to write
|
|
.DS I
|
|
\e"[^"]\(** {
|
|
if (yytext[yyleng\-1] == \(fm\e\e\(fm)
|
|
yymore();
|
|
else
|
|
... normal user processing
|
|
}
|
|
.DE
|
|
.LP
|
|
which will, when faced with a string such as
|
|
.I
|
|
"abc\e"def\|"
|
|
.R
|
|
first match
|
|
the five characters
|
|
\fI"abc\e\|\fR;
|
|
then
|
|
the call to
|
|
.I yymore()
|
|
will
|
|
cause the next part of the string,
|
|
\fI"def\|\fR,
|
|
to be tacked on the end.
|
|
Note that the final quote terminating the string should be picked
|
|
up in the code labeled ``normal processing''.
|
|
.PP
|
|
The function
|
|
.I
|
|
yyless()
|
|
.R
|
|
might be used to reprocess
|
|
text in various circumstances. Consider the C problem of distinguishing
|
|
the ambiguity of ``=\(mia''.
|
|
Suppose it is desired to treat this as ``=\(mi a''
|
|
but print a message. A rule might be
|
|
.ps 9
|
|
.vs 11
|
|
.DS I
|
|
=\(mi[a\-zA\-Z] {
|
|
printf("Op (=\(mi) ambiguous\en");
|
|
yyless(yyleng\-1);
|
|
... action for =\(mi ...
|
|
}
|
|
.DE
|
|
.ps 10
|
|
.vs 12
|
|
which prints a message, returns the letter after the
|
|
operator to the input stream, and treats the operator as ``=\(mi''.
|
|
Alternatively it might be desired to treat this as ``= \(mia''.
|
|
To do this, just return the minus
|
|
sign as well as the letter to the input:
|
|
.ps 9
|
|
.vs 11
|
|
.DS I
|
|
=\(mi[a\-zA\-Z] {
|
|
printf("Op (=\(mi) ambiguous\en");
|
|
yyless(yyleng\-2);
|
|
... action for = ...
|
|
}
|
|
.DE
|
|
.ps 10
|
|
.vs 12
|
|
will perform the other interpretation.
|
|
Note that the expressions for the two cases might more easily
|
|
be written
|
|
.DS I
|
|
=\(mi/[A\-Za\-z]
|
|
.DE
|
|
.LP
|
|
in the first case and
|
|
.DS I
|
|
=/\-[A\-Za\-z]
|
|
.DE
|
|
.LP
|
|
in the second;
|
|
no backup would be required in the rule action.
|
|
It is not necessary to recognize the whole identifier
|
|
to observe the ambiguity.
|
|
The
|
|
possibility of ``=\(mi3'', however, makes
|
|
.DS I
|
|
=\(mi/[^ \et\en]
|
|
.DE
|
|
a still better rule.
|
|
.PP
|
|
In addition to these routines, Lex also permits
|
|
access to the I/O routines
|
|
it uses. \**
|
|
.FS
|
|
Note: The output() routine is not supported in this version of Lex.
|
|
See yyout instead.
|
|
.FE
|
|
They are:
|
|
.IP 1)
|
|
.I
|
|
input()
|
|
.R
|
|
which returns the next input character; and
|
|
.IP 2)
|
|
.I
|
|
unput(c)
|
|
.R
|
|
pushes the character
|
|
.I
|
|
c
|
|
.R
|
|
back onto the input stream to be read later by
|
|
.I
|
|
input().
|
|
.R
|
|
.LP
|
|
By default these routines are provided as macro definitions,
|
|
but the user can override them and supply private versions.
|
|
These routines
|
|
define the relationship between external files and
|
|
internal characters, and must all be retained
|
|
or modified consistently.
|
|
They may be redefined, to
|
|
cause input or output to be transmitted to or from strange
|
|
places, including other programs or internal memory;
|
|
but the character set used must be consistent in all routines;
|
|
a value of zero returned by
|
|
.I
|
|
input
|
|
.R
|
|
must mean end of file; and
|
|
the relationship between
|
|
.I
|
|
unput
|
|
.R
|
|
and
|
|
.I
|
|
input
|
|
.R
|
|
must be retained
|
|
or the Lex look~ahead will not work.
|
|
Lex does not look ahead at all if it does not have to,
|
|
but every rule ending in
|
|
.ft I
|
|
+ \(** ?
|
|
.ft R
|
|
or
|
|
.ft I
|
|
$
|
|
.ft R
|
|
or containing
|
|
.ft I
|
|
/
|
|
.ft R
|
|
implies look~ahead.
|
|
Look~ahead is also necessary to match an expression that is a prefix
|
|
of another expression.
|
|
See below for a discussion of the character set used by Lex.
|
|
The standard Lex library imposes
|
|
a 100 character limit on backup.
|
|
.PP
|
|
Another Lex library routine that the user will sometimes want
|
|
to redefine is
|
|
.I
|
|
yywrap()
|
|
.R
|
|
which is called whenever Lex reaches an end-of-file.
|
|
If
|
|
.I
|
|
yywrap
|
|
.R
|
|
returns a 1, Lex continues with the normal wrapup on end of input.
|
|
Sometimes, however, it is convenient to arrange for more
|
|
input to arrive
|
|
from a new source.
|
|
In this case, the user should provide
|
|
a
|
|
.I
|
|
yywrap
|
|
.R
|
|
which
|
|
arranges for new input and
|
|
returns 0. This instructs Lex to continue processing.
|
|
The default
|
|
.I
|
|
yywrap
|
|
.R
|
|
always returns 1.
|
|
.PP
|
|
This routine is also a convenient place
|
|
to print tables, summaries, etc. at the end
|
|
of a program. Note that it is not
|
|
possible to write a normal rule which recognizes
|
|
end-of-file; the only access to this condition is
|
|
through
|
|
.I
|
|
yywrap.
|
|
.R
|
|
In fact, unless a private version of
|
|
.I
|
|
input()
|
|
.R
|
|
is supplied
|
|
a file containing nulls
|
|
cannot be handled,
|
|
since a value of 0 returned by
|
|
.I
|
|
input
|
|
.R
|
|
is taken to be end-of-file.
|
|
.PP
|
|
.NH
|
|
Ambiguous Source Rules.
|
|
.PP
|
|
Lex can handle ambiguous specifications.
|
|
When more than one expression can match the
|
|
current input, Lex chooses as follows:
|
|
.IP 1)
|
|
The longest match is preferred.
|
|
.IP 2)
|
|
Among rules which matched the same number of characters,
|
|
the rule given first is preferred.
|
|
.LP
|
|
Thus, suppose the rules
|
|
.DS I
|
|
integer keyword action ...;
|
|
[a\-z]+ identifier action ...;
|
|
.DE
|
|
.LP
|
|
to be given in that order. If the input is
|
|
.I integers ,
|
|
it is taken as an identifier, because
|
|
.I [a\-z]+
|
|
matches 8 characters while
|
|
.I integer
|
|
matches only 7.
|
|
If the input is
|
|
.I integer ,
|
|
both rules match 7 characters, and
|
|
the keyword rule is selected because it was given first.
|
|
Anything shorter (e.g. \fIint\fR\|) will
|
|
not match the expression
|
|
.I integer
|
|
and so the identifier interpretation is used.
|
|
.PP
|
|
The principle of preferring the longest
|
|
match makes rules containing
|
|
expressions like
|
|
.I \&.\(**
|
|
dangerous.
|
|
For example,
|
|
.DS I
|
|
\&\(fm.\(**\(fm
|
|
.DE
|
|
.LP
|
|
might seem a good way of recognizing
|
|
a string in single quotes.
|
|
But it is an invitation for the program to read far
|
|
ahead, looking for a distant
|
|
single quote.
|
|
Presented with the input
|
|
.DS I
|
|
\&\(fmfirst\(fm quoted string here, \(fmsecond\(fm here
|
|
.DE
|
|
.LP
|
|
the above expression will match
|
|
.DS I
|
|
\&\(fmfirst\(fm quoted string here, \(fmsecond\(fm
|
|
.DE
|
|
.LP
|
|
which is probably not what was wanted.
|
|
A better rule is of the form
|
|
.DS I
|
|
\&\(fm[^\(fm\en]\(**\(fm
|
|
.DE
|
|
.LP
|
|
which, on the above input, will stop
|
|
after
|
|
.I \(fmfirst\(fm .
|
|
The consequences
|
|
of errors like this are mitigated by the fact
|
|
that the
|
|
.I \&.
|
|
operator will not match newline.
|
|
Thus expressions like
|
|
.I \&.\(**
|
|
stop on the
|
|
current line.
|
|
Don't try to defeat this with expressions like
|
|
.I (.|\en)+
|
|
or
|
|
equivalents;
|
|
the Lex generated program will try to read
|
|
the entire input file, causing
|
|
internal buffer overflows.
|
|
.PP
|
|
Note that Lex is normally partitioning
|
|
the input stream, not searching for all possible matches
|
|
of each expression.
|
|
This means that each character is accounted for
|
|
once and only once.
|
|
For example, suppose it is desired to
|
|
count occurrences of both \fIshe\fR and \fIhe\fR in an input text.
|
|
Some Lex rules to do this might be
|
|
.DS I
|
|
she s++;
|
|
he h++;
|
|
\en |
|
|
\&. ;
|
|
.DE
|
|
.LP
|
|
where the last two rules ignore everything besides \fIhe\fR and \fIshe\fR.
|
|
Remember that . does not include newline.
|
|
Since \fIshe\fR includes \fIhe\fR, Lex will normally
|
|
.I
|
|
not
|
|
.R
|
|
recognize
|
|
the instances of \fIhe\fR included in \fIshe\fR,
|
|
since once it has passed a \fIshe\fR those characters are gone.
|
|
.PP
|
|
Sometimes the user would like to override this choice. The action
|
|
REJECT
|
|
means ``go do the next alternative.''
|
|
It causes whatever rule was second choice after the current
|
|
rule to be executed.
|
|
The position of the input pointer is adjusted accordingly.
|
|
Suppose the user really wants to count the included instances of \fIhe\fR:
|
|
.DS I
|
|
she {s++; REJECT;}
|
|
he {h++; REJECT;}
|
|
\en |
|
|
\&. ;
|
|
.DE
|
|
.LP
|
|
these rules are one way of changing the previous example
|
|
to do just that.
|
|
After counting each expression, it is rejected; whenever appropriate,
|
|
the other expression will then be counted. In this example, of course,
|
|
the user could note that \fIshe\fR includes \fIhe\fR but not
|
|
vice versa, and omit the REJECT action on \fIhe\fR;
|
|
in other cases, however, it
|
|
would not be possible a priori to tell
|
|
which input characters
|
|
were in both classes.
|
|
.PP
|
|
Consider the two rules
|
|
.DS I
|
|
a[bc]+ { ... ; REJECT;}
|
|
a[cd]+ { ... ; REJECT;}
|
|
.DE
|
|
.LP
|
|
If the input is
|
|
.I ab ,
|
|
only the first rule matches,
|
|
and on
|
|
.I ad
|
|
only the second matches.
|
|
The input string
|
|
.I accb
|
|
matches the first rule for four characters
|
|
and then the second rule for three characters.
|
|
In contrast, the input
|
|
.I accd
|
|
agrees with
|
|
the second rule for four characters and then the first
|
|
rule for three.
|
|
.PP
|
|
In general, REJECT is useful whenever
|
|
the purpose of Lex is not to partition the input
|
|
stream but to detect all examples of some items
|
|
in the input, and the instances of these items
|
|
may overlap or include each other.
|
|
Suppose a digram table of the input is desired;
|
|
normally the digrams overlap, that is the word
|
|
.I the
|
|
is considered to contain
|
|
both
|
|
.I th
|
|
and
|
|
.I he .
|
|
Assuming a two-dimensional array named
|
|
.ul
|
|
digram
|
|
to be incremented, the appropriate
|
|
source is
|
|
.DS I
|
|
%%
|
|
[a\-z][a\-z] {
|
|
digram[yytext[0]][yytext[1]]++;
|
|
REJECT;
|
|
}
|
|
\&. ;
|
|
\en ;
|
|
.DE
|
|
.LP
|
|
where the REJECT is necessary to pick up
|
|
a letter pair beginning at every character, rather than at every
|
|
other character.
|
|
.NH
|
|
Lex Source Definitions.
|
|
.PP
|
|
Remember the format of the Lex
|
|
source:
|
|
.DS I
|
|
{definitions}
|
|
%%
|
|
{rules}
|
|
%%
|
|
{user routines}
|
|
.DE
|
|
.LP
|
|
So far only the rules have been described. The user needs
|
|
additional options,
|
|
though, to define variables for use in his program and for use
|
|
by Lex.
|
|
These can go either in the definitions section
|
|
or in the rules section.
|
|
.PP
|
|
Remember that Lex is turning the rules into a program.
|
|
Any source not intercepted by Lex is copied
|
|
into the generated program. There are three classes
|
|
of such things.
|
|
.IP 1)
|
|
Any line which is not part of a Lex rule or action
|
|
which begins with a blank or tab is copied into
|
|
the Lex generated program.
|
|
Such source input prior to the first %% delimiter will be external
|
|
to any function in the code; if it appears immediately after the first
|
|
%%,
|
|
it appears in an appropriate place for declarations
|
|
in the function written by Lex which contains the actions.
|
|
This material must look like program fragments,
|
|
and should precede the first Lex rule.
|
|
.IP
|
|
As a side effect of the above, lines which begin with a blank
|
|
or tab, and which contain a comment,
|
|
are passed through to the generated program.
|
|
This can be used to include comments in either the Lex source or
|
|
the generated code. The comments should follow the host
|
|
language convention.
|
|
.IP 2)
|
|
Anything included between lines containing
|
|
only
|
|
.I %{
|
|
and
|
|
.I %}
|
|
is
|
|
copied out as above. The delimiters are discarded.
|
|
This format permits entering text like preprocessor statements that
|
|
must begin in column 1,
|
|
or copying lines that do not look like programs.
|
|
.IP 3)
|
|
Anything after the second %% delimiter, regardless of formats, etc.,
|
|
is copied out after the Lex output.
|
|
.PP
|
|
Definitions intended for Lex are given
|
|
before the first %% delimiter. Any line in this section
|
|
not contained between %{ and %}, and begining
|
|
in column 1, is assumed to define Lex substitution strings.
|
|
The format of such lines is
|
|
.DS I
|
|
name translation
|
|
.DE
|
|
.LP
|
|
and it
|
|
causes the string given as a translation to
|
|
be associated with the name.
|
|
The name and translation
|
|
must be separated by at least one blank or tab, and the name must begin with
|
|
a letter or an underscore (`_').
|
|
The translation can then be called out
|
|
by the {name} syntax in a rule.
|
|
Using {D} for the digits and {E} for an exponent field,
|
|
for example, might abbreviate rules to recognize numbers:
|
|
.DS I
|
|
D [0\-9]
|
|
E [DEde][\-+]?{D}+
|
|
%%
|
|
{D}+ printf("integer");
|
|
{D}+"."{D}\(**({E})? |
|
|
{D}\(**"."{D}+({E})? |
|
|
{D}+{E} printf("real");
|
|
.DE
|
|
.LP
|
|
Note the first two rules for real numbers;
|
|
both require a decimal point and contain
|
|
an optional exponent field,
|
|
but the first requires at least one digit before the
|
|
decimal point and the second requires at least one
|
|
digit after the decimal point.
|
|
To correctly handle the problem
|
|
posed by a Fortran expression such as
|
|
.I 35.EQ.I ,
|
|
which does not contain a real number, a context-sensitive
|
|
rule such as
|
|
.DS I
|
|
[0\-9]+/"."EQ printf("integer");
|
|
.DE
|
|
.LP
|
|
could be used in addition to the normal rule for integers.
|
|
.PP
|
|
The definitions
|
|
section may also contain other commands, including the
|
|
selection of a host language, a character set table,
|
|
a list of start conditions, or adjustments to the default
|
|
size of arrays within Lex itself for larger source programs.
|
|
These possibilities
|
|
are discussed below under ``Summary of Source Format,''
|
|
section 12.
|
|
.NH
|
|
Usage.
|
|
.PP
|
|
There are two steps in
|
|
compiling a Lex source program.
|
|
First, the Lex source must be turned into a generated program
|
|
in the host general purpose language.
|
|
Then this program must be compiled and loaded, usually with
|
|
a library of Lex subroutines.
|
|
The generated program
|
|
is on a file named
|
|
.I lex.yy.c .
|
|
The I/O library is defined in terms of the C standard
|
|
library [6].
|
|
.PP
|
|
The C programs generated by Lex are slightly different
|
|
on OS/370, because the
|
|
OS compiler is less powerful than the UNIX or GCOS compilers,
|
|
and does less at compile time.
|
|
C programs generated on GCOS and UNIX are the same.
|
|
.PP
|
|
.I
|
|
UNIX.
|
|
.R
|
|
The library is accessed by the loader flag
|
|
.I \-ll .
|
|
So an appropriate
|
|
set of commands is
|
|
.DS I
|
|
lex source
|
|
cc lex.yy.c \-ll
|
|
.DE
|
|
.LP
|
|
The resulting program is placed on the usual file
|
|
.I
|
|
a.out
|
|
.R
|
|
for later execution.
|
|
To use Lex with Yacc see below.
|
|
Although the default Lex I/O routines use the C standard library,
|
|
the Lex automata themselves do not do so;
|
|
if private versions of
|
|
.I
|
|
input,
|
|
output
|
|
.R
|
|
and
|
|
.I unput
|
|
are given, the library can be avoided.
|
|
.PP
|
|
.NH
|
|
Lex and Yacc.
|
|
.PP
|
|
If you want to use Lex with Yacc, note that what Lex writes is a program
|
|
named
|
|
.I
|
|
yylex(),
|
|
.R
|
|
the name required by Yacc for its analyzer.
|
|
Normally, the default main program on the Lex library
|
|
calls this routine, but if Yacc is loaded, and its main
|
|
program is used, Yacc will call
|
|
.I
|
|
yylex().
|
|
.R
|
|
In this case each Lex rule should end with
|
|
.DS I
|
|
return(token);
|
|
.DE
|
|
.LP
|
|
where the appropriate token value is returned.
|
|
An easy way to get access
|
|
to Yacc's names for tokens is to
|
|
compile the Lex output file as part of
|
|
the Yacc output file by placing the line
|
|
.DS I
|
|
# include "lex.yy.c"
|
|
.DE
|
|
.LP
|
|
in the last section of Yacc input.
|
|
Supposing the grammar to be
|
|
named ``good'' and the lexical rules to be named ``better''
|
|
the UNIX command sequence can just be:
|
|
.DS I
|
|
yacc good
|
|
lex better
|
|
cc y.tab.c \-ly \-ll
|
|
.DE
|
|
.LP
|
|
The Yacc library (\-ly) should be loaded before the Lex library,
|
|
to obtain a main program which invokes the Yacc parser.
|
|
The generations of Lex and Yacc programs can be done in
|
|
either order.
|
|
.NH
|
|
Examples.
|
|
.PP
|
|
As a trivial problem, consider copying an input file while
|
|
adding 3 to every positive number divisible by 7.
|
|
Here is a suitable Lex source program:
|
|
.DS I
|
|
%%
|
|
int k;
|
|
[0\-9]+ {
|
|
k = atoi(yytext);
|
|
if (k%7 == 0)
|
|
printf("%d", k+3);
|
|
else
|
|
printf("%d",k);
|
|
}
|
|
.DE
|
|
.LP
|
|
The rule [0\-9]+ recognizes strings of digits;
|
|
.I
|
|
atoi
|
|
.R
|
|
converts the digits to binary
|
|
and stores the result in
|
|
.ul
|
|
k.
|
|
The operator % (remainder) is used to check whether
|
|
.ul
|
|
k
|
|
is divisible by 7; if it is,
|
|
it is incremented by 3 as it is written out.
|
|
It may be objected that this program will alter such
|
|
input items as
|
|
.I 49.63
|
|
or
|
|
.I X7 .
|
|
Furthermore, it increments the absolute value
|
|
of all negative numbers divisible by 7.
|
|
To avoid this, just add a few more rules after the active one,
|
|
as here:
|
|
.DS I
|
|
%%
|
|
int k;
|
|
\-?[0\-9]+ {
|
|
k = atoi(yytext);
|
|
printf("%d",
|
|
k%7 == 0 ? k+3 : k);
|
|
}
|
|
\-?[0\-9.]+ ECHO;
|
|
[A-Za-z][A-Za-z0-9]+ ECHO;
|
|
.DE
|
|
.LP
|
|
Numerical strings containing
|
|
a ``.'' or preceded by a letter will be picked up by
|
|
one of the last two rules, and not changed.
|
|
The
|
|
.I if\-else
|
|
has been replaced by
|
|
a C conditional expression to save space;
|
|
the form
|
|
.ul
|
|
a?b:c
|
|
means ``if
|
|
.I a
|
|
then
|
|
.I b
|
|
else
|
|
.I c ''.
|
|
.PP
|
|
For an example of statistics gathering, here
|
|
is a program which histograms the lengths
|
|
of words, where a word is defined as a string of letters.
|
|
.DS
|
|
int lengs[100];
|
|
%%
|
|
[a\-z]+ lengs[yyleng]++;
|
|
\&. |
|
|
\en ;
|
|
%%
|
|
yywrap()
|
|
{
|
|
int i;
|
|
printf("Length No. words\en");
|
|
for(i=0; i<100; i++)
|
|
if (lengs[i] > 0)
|
|
printf("%5d%10d\en",i,lengs[i]);
|
|
return(1);
|
|
}
|
|
.DE
|
|
.LP
|
|
This program
|
|
accumulates the histogram, while producing no output. At the end
|
|
of the input it prints the table.
|
|
The final statement
|
|
.I
|
|
return(1);
|
|
.R
|
|
indicates that Lex is to perform wrapup. If
|
|
.I
|
|
yywrap
|
|
.R
|
|
returns zero (false)
|
|
it implies that further input is available
|
|
and the program is
|
|
to continue reading and processing.
|
|
To provide a
|
|
.I
|
|
yywrap
|
|
.R
|
|
that never
|
|
returns true causes an infinite loop.
|
|
.PP
|
|
As a larger example,
|
|
here are some parts of a program written by N. L. Schryer
|
|
to convert double precision Fortran to single precision Fortran.
|
|
Because Fortran does not distinguish upper and lower case letters,
|
|
this routine begins by defining a set of classes including
|
|
both cases of each letter:
|
|
.DS I
|
|
a [aA]
|
|
b [bB]
|
|
c [cC]
|
|
\&...
|
|
z [zZ]
|
|
.DE
|
|
.LP
|
|
An additional class recognizes white space:
|
|
.DS I
|
|
W [ \et]\(**
|
|
.DE
|
|
.LP
|
|
The first rule changes
|
|
``double precision'' to ``real'', or ``DOUBLE PRECISION'' to ``REAL''.
|
|
.DS I
|
|
{d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} {
|
|
printf(yytext[0]==\(fmd\(fm? "real" : "REAL");
|
|
}
|
|
.DE
|
|
.LP
|
|
Care is taken throughout this program to preserve the case
|
|
(upper or lower)
|
|
of the original program.
|
|
The conditional operator is used to
|
|
select the proper form of the keyword.
|
|
The next rule copies continuation card indications to
|
|
avoid confusing them with constants:
|
|
.DS I
|
|
^" "[^ 0] ECHO;
|
|
.DE
|
|
.LP
|
|
In the regular expression, the quotes surround the
|
|
blanks.
|
|
It is interpreted as
|
|
``beginning of line, then five blanks, then
|
|
anything but blank or zero.''
|
|
Note the two different meanings of
|
|
.I ^ .
|
|
There follow some rules to change double precision
|
|
constants to ordinary floating constants.
|
|
.DS I
|
|
[0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ |
|
|
[0\-9]+{W}"."{W}{d}{W}[+\-]?{W}[0\-9]+ |
|
|
"."{W}[0\-9]+{W}{d}{W}[+\-]?{W}[0\-9]+ {
|
|
/\(** convert constants \(**/
|
|
for(p=yytext; \(**p != 0; p++)
|
|
{
|
|
if (\(**p == \(fmd\(fm || \(**p == \(fmD\(fm)
|
|
\(**p=+ \(fme\(fm\- \(fmd\(fm;
|
|
ECHO;
|
|
}
|
|
.DE
|
|
.LP
|
|
After the floating point constant is recognized, it is
|
|
scanned by the
|
|
.ul
|
|
for
|
|
loop
|
|
to find the letter
|
|
.I d
|
|
or
|
|
.I D .
|
|
The program than adds
|
|
.I \(fme\(fm\-\(fmd\(fm ,
|
|
which converts
|
|
it to the next letter of the alphabet.
|
|
The modified constant, now single-precision,
|
|
is written out again.
|
|
There follow a series of names which must be respelled to remove
|
|
their initial \fId\fR.
|
|
By using the
|
|
array
|
|
.I
|
|
yytext
|
|
.R
|
|
the same action suffices for all the names (only a sample of
|
|
a rather long list is given here).
|
|
.DS I
|
|
{d}{s}{i}{n} |
|
|
{d}{c}{o}{s} |
|
|
{d}{s}{q}{r}{t} |
|
|
{d}{a}{t}{a}{n} |
|
|
\&...
|
|
{d}{f}{l}{o}{a}{t} printf("%s",yytext+1);
|
|
.DE
|
|
.LP
|
|
Another list of names must have initial \fId\fR changed to initial \fIa\fR:
|
|
.DS I
|
|
{d}{l}{o}{g} |
|
|
{d}{l}{o}{g}10 |
|
|
{d}{m}{i}{n}1 |
|
|
{d}{m}{a}{x}1 {
|
|
yytext[0] =+ \(fma\(fm \- \(fmd\(fm;
|
|
ECHO;
|
|
}
|
|
.DE
|
|
.LP
|
|
And one routine
|
|
must have initial \fId\fR changed to initial \fIr\fR:
|
|
.DS I
|
|
{d}1{m}{a}{c}{h} {yytext[0] =+ \(fmr\(fm \- \(fmd\(fm;
|
|
ECHO;
|
|
}
|
|
.DE
|
|
.LP
|
|
To avoid such names as \fIdsinx\fR being detected as instances
|
|
of \fIdsin\fR, some final rules pick up longer words as identifiers
|
|
and copy some surviving characters:
|
|
.DS I
|
|
[A\-Za\-z][A\-Za\-z0\-9]\(** |
|
|
[0\-9]+ |
|
|
\en |
|
|
\&. ECHO;
|
|
.DE
|
|
.LP
|
|
Note that this program is not complete; it
|
|
does not deal with the spacing problems in Fortran or
|
|
with the use of keywords as identifiers.
|
|
.br
|
|
.NH
|
|
Left Context Sensitivity.
|
|
.PP
|
|
Sometimes
|
|
it is desirable to have several sets of lexical rules
|
|
to be applied at different times in the input.
|
|
For example, a compiler preprocessor might distinguish
|
|
preprocessor statements and analyze them differently
|
|
from ordinary statements.
|
|
This requires
|
|
sensitivity
|
|
to prior context, and there are several ways of handling
|
|
such problems.
|
|
The \fI^\fR operator, for example, is a prior context operator,
|
|
recognizing immediately preceding left context just as \fI$\fR recognizes
|
|
immediately following right context.
|
|
Adjacent left context could be extended, to produce a facility similar to
|
|
that for adjacent right context, but it is unlikely
|
|
to be as useful, since often the relevant left context
|
|
appeared some time earlier, such as at the beginning of a line.
|
|
.PP
|
|
This section describes three means of dealing
|
|
with different environments: a simple use of flags,
|
|
when only a few rules change from one environment to another,
|
|
the use of
|
|
.I
|
|
start conditions
|
|
.R
|
|
on rules,
|
|
and the possibility of making multiple lexical analyzers all run
|
|
together.
|
|
In each case, there are rules which recognize the need to change the
|
|
environment in which the
|
|
following input text is analyzed, and set some parameter
|
|
to reflect the change. This may be a flag explicitly tested by
|
|
the user's action code; such a flag is the simplest way of dealing
|
|
with the problem, since Lex is not involved at all.
|
|
It may be more convenient,
|
|
however,
|
|
to have Lex remember the flags as initial conditions on the rules.
|
|
Any rule may be associated with a start condition. It will only
|
|
be recognized when Lex is in
|
|
that start condition.
|
|
The current start condition may be changed at any time.
|
|
Finally, if the sets of rules for the different environments
|
|
are very dissimilar,
|
|
clarity may be best achieved by writing several distinct lexical
|
|
analyzers, and switching from one to another as desired.
|
|
.PP
|
|
Consider the following problem: copy the input to the output,
|
|
changing the word \fImagic\fR to \fIfirst\fR on every line which began
|
|
with the letter \fIa\fR, changing \fImagic\fR to \fIsecond\fR on every line
|
|
which began with the letter \fIb\fR, and changing
|
|
\fImagic\fR to \fIthird\fR on every line which began
|
|
with the letter \fIc\fR. All other words and all other lines
|
|
are left unchanged.
|
|
.PP
|
|
These rules are so simple that the easiest way
|
|
to do this job is with a flag:
|
|
.DS
|
|
int flag;
|
|
%%
|
|
^a {flag = \(fma\(fm; ECHO;}
|
|
^b {flag = \(fmb\(fm; ECHO;}
|
|
^c {flag = \(fmc\(fm; ECHO;}
|
|
\en {flag = 0 ; ECHO;}
|
|
magic {
|
|
switch (flag)
|
|
{
|
|
case \(fma\(fm: printf("first"); break;
|
|
case \(fmb\(fm: printf("second"); break;
|
|
case \(fmc\(fm: printf("third"); break;
|
|
default: ECHO; break;
|
|
}
|
|
}
|
|
.DE
|
|
.LP
|
|
should be adequate.
|
|
.PP
|
|
To handle the same problem with start conditions, each
|
|
start condition must be introduced to Lex in the definitions section
|
|
with a line reading
|
|
.DS I
|
|
%s name1 name2 ...
|
|
.DE
|
|
.LP
|
|
or
|
|
.DS I
|
|
%x name1 name2 ...
|
|
.DE
|
|
.LP
|
|
where the conditions may be named in any order.
|
|
`%s' denotes \fIinclusive\fR start conditions and `%x' denotes
|
|
\fIexclusive\fR start conditions.
|
|
The conditions may be referenced at the
|
|
head of a rule with the <> brackets:
|
|
.DS I
|
|
<name1>expression
|
|
.DE
|
|
.LP
|
|
is a rule which is only recognized when Lex is in the
|
|
start condition \fIname1\fR.
|
|
To enter a start condition,
|
|
execute the action statement
|
|
.DS I
|
|
BEGIN name1;
|
|
.DE
|
|
.LP
|
|
which changes the start condition to \fIname1\fR.
|
|
Until the next BEGIN action is executed, rules with the given
|
|
start condition will be active and rules with other start conditions
|
|
will be inactive. If the start condition is inclusive, then
|
|
rules with no start conditions at all will also be active. If it is
|
|
exclusive, then only rules qualified with the start condition will be active.
|
|
.PP
|
|
To resume the normal state,
|
|
.DS I
|
|
BEGIN 0;
|
|
.DE
|
|
.LP
|
|
resets the initial condition
|
|
of the Lex automaton interpreter.
|
|
A rule may be active in several
|
|
start conditions:
|
|
.DS I
|
|
<name1,name2,name3>
|
|
.DE
|
|
.LP
|
|
is a legal prefix.
|
|
.PP
|
|
The same example as before can be written:
|
|
.DS I
|
|
%START AA BB CC
|
|
%%
|
|
^a {ECHO; BEGIN AA;}
|
|
^b {ECHO; BEGIN BB;}
|
|
^c {ECHO; BEGIN CC;}
|
|
\en {ECHO; BEGIN 0;}
|
|
<AA>magic printf("first");
|
|
<BB>magic printf("second");
|
|
<CC>magic printf("third");
|
|
.DE
|
|
.LP
|
|
where the logic is exactly the same as in the previous
|
|
method of handling the problem, but Lex does the work
|
|
rather than the user's code.
|
|
.\" .NH
|
|
.\" Character Set.
|
|
.\" .PP
|
|
.\" The programs generated by Lex handle
|
|
.\" character I/O only through the routines
|
|
.\" .I
|
|
.\" input,
|
|
.\" output,
|
|
.\" .R
|
|
.\" and
|
|
.\" .I
|
|
.\" unput.
|
|
.\" .R
|
|
.\" Thus the character representation
|
|
.\" provided in these routines
|
|
.\" is accepted by Lex and employed to return
|
|
.\" values in
|
|
.\" .I
|
|
.\" yytext.
|
|
.\" .R
|
|
.\" For internal use
|
|
.\" a character is represented as a small integer
|
|
.\" which, if the standard library is used,
|
|
.\" has a value equal to the integer value of the bit
|
|
.\" pattern representing the character on the host computer.
|
|
.\" Normally, the letter
|
|
.\" .I a
|
|
.\" is represented as the same form as the character constant
|
|
.\" .I \(fma\(fm .
|
|
.\" If this interpretation is changed, by providing I/O
|
|
.\" routines which translate the characters,
|
|
.\" Lex must be told about
|
|
.\" it, by giving a translation table.
|
|
.\" This table must be in the definitions section,
|
|
.\" and must be bracketed by lines containing only
|
|
.\" ``%T''.
|
|
.\" The table contains lines of the form
|
|
.\" .DS I
|
|
.\" {integer} {character string}
|
|
.\" .DE
|
|
.\" .LP
|
|
.\" which indicate the value associated with each character.
|
|
.\" Thus the next example
|
|
.\" .DS I
|
|
.\" %T
|
|
.\" 1 Aa
|
|
.\" 2 Bb
|
|
.\" \&...
|
|
.\" 26 Zz
|
|
.\" 27 \en
|
|
.\" 28 +
|
|
.\" 29 \-
|
|
.\" 30 0
|
|
.\" 31 1
|
|
.\" \&...
|
|
.\" 39 9
|
|
.\" %T
|
|
.\" .DE
|
|
.\" .LP 1
|
|
.\" Sample character table.
|
|
.\" maps the lower and upper case letters together into the integers 1 through 26,
|
|
.\" newline into 27, + and \- into 28 and 29, and the
|
|
.\" digits into 30 through 39.
|
|
.\" Note the escape for newline.
|
|
.\" If a table is supplied, every character that is to appear either
|
|
.\" in the rules or in any valid input must be included
|
|
.\" in the table.
|
|
.\" No character
|
|
.\" may be assigned the number 0, and no character may be
|
|
.\" assigned a bigger number than the size of the hardware character set.
|
|
.NH
|
|
Summary of Source Format.
|
|
.PP
|
|
The general form of a Lex source file is:
|
|
.DS I
|
|
{definitions}
|
|
%%
|
|
{rules}
|
|
%%
|
|
{user subroutines}
|
|
.DE
|
|
.LP
|
|
The definitions section contains
|
|
a combination of
|
|
.IP 1)
|
|
Definitions, in the form ``name space translation''.
|
|
.IP 2)
|
|
Included code, in the form ``space code''.
|
|
.IP 3)
|
|
Included code, in the form
|
|
.DS I
|
|
%{
|
|
code
|
|
%}
|
|
.DE
|
|
.ns
|
|
.IP 4)
|
|
Start conditions, given in the form
|
|
.DS I
|
|
%s name1 name2 ...
|
|
.DE
|
|
.\" .ns
|
|
.\" .IP 5)
|
|
.\" Character set tables, in the form
|
|
.\" .DS I
|
|
.\" %T
|
|
.\" number space character-string
|
|
.\" \&...
|
|
.\" %T
|
|
.\" .DE
|
|
.\" .ns
|
|
.\" .IP 5)
|
|
.\" Changes to internal array sizes, in the form
|
|
.\" .DS I
|
|
.\" %\fIx\fR\0\0\fInnn\fR
|
|
.\" .DE
|
|
.\" .LP
|
|
.\" where \fInnn\fR is a decimal integer representing an array size
|
|
.\" and \fIx\fR selects the parameter as follows:
|
|
.\" .DS I
|
|
.\" Letter Parameter
|
|
.\" p positions
|
|
.\" n states
|
|
.\" e tree nodes
|
|
.\" a transitions
|
|
.\" k packed character classes
|
|
.\" o output array size
|
|
.\" .DE
|
|
.LP
|
|
Lines in the rules section have the form ``expression action''
|
|
where the action may be continued on succeeding
|
|
lines by using braces to delimit it.
|
|
.PP
|
|
Regular expressions in Lex use the following
|
|
operators:
|
|
.br
|
|
.DS I
|
|
x the character "x"
|
|
"x" an "x", even if x is an operator.
|
|
\ex an "x", even if x is an operator.
|
|
[xy] the character x or y.
|
|
[x\-z] the characters x, y or z.
|
|
[^x] any character but x.
|
|
\&. any character but newline.
|
|
^x an x at the beginning of a line.
|
|
<y>x an x when Lex is in start condition y.
|
|
x$ an x at the end of a line.
|
|
x? an optional x.
|
|
x\(** 0,1,2, ... instances of x.
|
|
x+ 1,2,3, ... instances of x.
|
|
x|y an x or a y.
|
|
(x) an x.
|
|
x/y an x but only if followed by y.
|
|
{xx} the translation of xx from the
|
|
definitions section.
|
|
x{m,n} \fIm\fR through \fIn\fR occurrences of x
|
|
.DE
|
|
.\" .NH
|
|
.\" Caveats and Bugs.
|
|
.\" .PP
|
|
.\" There are pathological expressions which
|
|
.\" produce exponential growth of the tables when
|
|
.\" converted to deterministic machines;
|
|
.\" fortunately, they are rare.
|
|
.\" .PP
|
|
.\" REJECT does not rescan the input; instead it remembers the results of the previous
|
|
.\" scan. This means that if a rule with trailing context is found, and
|
|
.\" REJECT executed, the user
|
|
.\" must not have used
|
|
.\" .ul
|
|
.\" unput
|
|
.\" to change the characters forthcoming
|
|
.\" from the input stream.
|
|
.\" This is the only restriction on the user's ability to manipulate
|
|
.\" the not-yet-processed input.
|
|
.\" .PP
|
|
.\" .NH
|
|
.\" Acknowledgments.
|
|
.\" .PP
|
|
.\" As should
|
|
.\" be obvious from the above, the outside of Lex
|
|
.\" is patterned
|
|
.\" on Yacc and the inside on Aho's string matching routines.
|
|
.\" Therefore, both S. C. Johnson and A. V. Aho
|
|
.\" are really originators
|
|
.\" of much of Lex,
|
|
.\" as well as debuggers of it.
|
|
.\" Many thanks are due to both.
|
|
.\" .PP
|
|
.\" The code of the current version of Lex was designed, written,
|
|
.\" and debugged by Eric Schmidt.
|
|
.\" .SG MH-1274-MEL-unix
|
|
.\" .sp 1
|
|
.\" .2C
|
|
.NH
|
|
References.
|
|
.sp 1v
|
|
.IP 1.
|
|
B. W. Kernighan and D. M. Ritchie,
|
|
.I
|
|
The C Programming Language,
|
|
.R
|
|
Prentice-Hall, N. J. (1978).
|
|
.IP 2.
|
|
B. W. Kernighan,
|
|
.I
|
|
Ratfor: A Preprocessor for a Rational Fortran,
|
|
.R
|
|
Software \- Practice and Experience,
|
|
\fB5\fR, pp. 395-496 (1975).
|
|
.IP 3.
|
|
S. C. Johnson,
|
|
.I
|
|
Yacc: Yet Another Compiler Compiler,
|
|
.R
|
|
Computing Science Technical Report No. 32,
|
|
1975,
|
|
.MH
|
|
.\" .if \n(tm (also TM 75-1273-6)
|
|
.IP 4.
|
|
A. V. Aho and M. J. Corasick,
|
|
.I
|
|
Efficient String Matching: An Aid to Bibliographic Search,
|
|
.R
|
|
Comm. ACM
|
|
.B
|
|
18,
|
|
.R
|
|
333-340 (1975).
|
|
.IP 5.
|
|
B. W. Kernighan, D. M. Ritchie and K. L. Thompson,
|
|
.I
|
|
QED Text Editor,
|
|
.R
|
|
Computing Science Technical Report No. 5,
|
|
1972,
|
|
.MH
|
|
.IP 6.
|
|
D. M. Ritchie,
|
|
private communication.
|
|
See also
|
|
M. E. Lesk,
|
|
.I
|
|
The Portable C Library,
|
|
.R
|
|
Computing Science Technical Report No. 31,
|
|
.MH
|
|
.\" .if \n(tm (also TM 75-1274-11)
|