For the last seven years our research group
has been working on tools to aid the maintainer of old code. Not
surprisingly, we have developed a keen interest in ways of making
our own code easier to understand and maintain. Recently we have
found that the technique of grammar based programming seems
to provide good, maintainable solutions to many of our programming
problems.
Programs built around grammars seem to have
several benefits that enhance maintainability. They are readable
since the grammar acts as a working formal specification of program
behavior. They are robust since the grammar eliminates
many logic and " I didn't think of that " errors. They
are flexible, and can be easily grown or extended as experience
accumulates. And finally they are portable, since they
use ASCII text and standard C/C++ programming techniques.
We have used grammar-based programming for
several applications that do not much resemble traditional compiler
construction tasks. One system generated test drivers for unit
testing of C++ object classes. Another was a querying system to
help maintainers extract information about large software systems.
Still others have taken output from existing code analysis tools
and translated it to work with our program understanding tools.
For the last three years we have been using
the
AnaGram grammar-based programming environment and we
have found that it greatly helps us in developing these applications
quickly. AnaGram is a PC multi-window system for writing, analyzing,
and debugging programs that use grammars. We have had several
good experiences with inexperienced students who produced correct,
readable code using AnaGram.
This report, which describes some of the benefits
of grammar-based programming and the AnaGram environment, was
originally prepared at the request of SERC affiliate Siemens Corporate
Research for the Siemens CASELAB system, which distributes information
on CASE tools and methods to Siemens' employees worldwide. The
views presented are exclusively those of the author.
This report may be cited as SERC-TR--76F, Software Engineering Research Center, University of Florida, CSE-301, Gainesville, FL 32611, July/94
When you say the word "parsing" to most programmers
the immediate word-association response is "compiler writing".
Computer Science students are usually introduced to grammars in
a compilers course that focuses on this one narrow application
of parsing tools. Since few programmers write compilers after
leaving university, we tend to think of parsing as a solution
to somebody else's problem.
But parsing technology can be applied in a
wide range of situations in which a program is controlled by a
sequence of inputs. Parser generators can thus be used to provide
clean, maintainable code to solve many programming problems.
A parser generator reads a syntax file containing
a grammar and associated actions specified for both
normal and error inputs. From these it creates a C or C++ source
program (Figure 1). This generated program, when compiled and
executed, will accept sequences of inputs that match the grammar
and carry out the corresponding actions. If the inputs do not
match the grammar, the program will instead perform the specified
error handling actions.
The main benefits of organizing a program
around a grammar are readability, robustness, flexibility
and portability. Together these make for more maintainable
code.
It is generally easy to read and understand a grammar based program
because the grammar explains most of the complicated logic the
reader might need to worry about. For example, the key part of
the AnaGram syntax file in Appendix I, is the following:
grammar
-> line?..., eof = writeUnidentifiedFileList();
line
-> trace line, newline
-> source line, newline
-> other line, newline
trace line
-> "!trace on lines
at ", location, not eol?...
location
-> my file:i, function spec, line spec:n = writeR2tLine(i,n);
-> routine entry point
-> meaningless number
-> unidentified location
This tells us immediately that the input (called grammar)
is expected to be a sequence of lines
and that each line
may be a trace line,
a source line,
or an other line.
Trace lines start with the string "!trace
on lines at " which is followed
by location
information. If the location
is composed of a file name, followed by a function name followed
by a line number then the writeR2tLine
function is called. Other cases, such as a routine
entry point are ignored.
While we may not understand immediately what
an "r2t" line is, the overall behavior of the program
is very clear. This program is scanning the input for trace lines
and writing something when it finds each one. The grammar provides
an excellent top down specification of what it expects as input
and what it will do when it gets it.
In a conventional program it can be very difficult to check that
all contingencies have been handled. What if there are extra blanks
in the input? What if the last input line is not complete? Special
cases such as these are often found during testing and handled
by patching the program's logic. After a few such patches the
program becomes incomprehensible.
In a grammar based program, the grammar defines
behavior for all such cases. The parser generator then
checks for consistency and warns the programmer of possible errors.
We have found that many of our logic errors are caught by the
parser generator and relatively few get through into testing of
the final program.
Grammar based programs can often be extended
easily to handle new cases. Often the addition of a couple of
alternatives to the grammar is sufficient to take care of a new
requirement. Such a change could be very complex to introduce
in a conventional program because so many of the program's states
could be affected. But the parser generator hides this
kind of complication and generates a new, correct, program from
the modified grammar.
This flexibility can be especially useful
in exploratory programming [WILD.94]. For example, the program
in Appendix I takes output from a debugger and translates it for
use in one of our projects. The exact form of the debugger's output
was known only from examples. As a first cut at the problem the
grammar in the appendix was written from one of these examples.
The error handling features of the parser generator can then be
used to systematically develop a complete and correct program.
The grammar contains the following production for error handling:
other line
-> error = wrError(PCB.line,
PCB.column);
The token error
covers "anything else not described by this grammar".
Any input line that does not match the grammar will produce a
call to the wrError
function that writes a message to standard error. The offending
input can then be tracked down and added to the grammar.
We have used this technique several times.
The result is a program that is easy to understand despite the
many revisions. The grammar acts as a formal but readable specification
at all stages of the process.
Grammar-based programs usually are designed
to process simple ASCII text as input and thus tend to be easily
portable. So far, AnaGram generated programs have worked with
all of the C or C++ compilers we have tried. Since it can be difficult
to predict all the environments a program may eventually face,
a conservative grammar-based design can reduce the life-cycle
cost of repeated porting.
In this section we will describe some of the
problems we have solved using grammar-based programming as part
of our tool-building work. There are, of course, many other situations
in which this technique can be useful.
To save development time, we try to make our
tools use existing tools as much as possible. However this frequently
forces us to write data transformation programs. For example the
syntax file in Appendix I generated a program to transform output
from the Digital Equipment Vax debugger for some experiments with
a program understanding tool.
Such transformation programs can be written
quickly using grammar-based programming and, as described, error
handling facilities of the parser generator allow systematic exploratory
programming without making a rat's nest of the program logic.
We have written several such programs as part
of our maintenance tool set, including one to extract information
from a C++ code analyzer and another to normalize Prolog facts
in a C code factbase.
Jon Bentley in "More Programming Pearls"
describes the technique of inventing "little languages"
to solve problems [BEN.88]. Instead of writing a program with
a large number of switches and parameters for the user to set,
the programmer defines a language in which the user can express
what needs to be done. Then a parser generator can be used to
produce the little language program to actually solve the user's
problem (Figure 3).
In one project we applied this method to define
a simple query language that software maintainers could use to
get information about the code they were maintaining [RICH.93].
The query program was written in AnaGram by an undergraduate student
and was one of his first C language projects. Yet the final code
was very easy to inspect and was almost completely error free.
In another project, we developed a tool to
help unit test C++ object classes. A little language was defined
to specify objects and test conditions. The generated testing
tool could read test specifications in the language and generate
a C++ test driver program that would create the objects and step
them through combinations of the specified conditions [WILD.93].
One reason that grammar-based programming
has been little applied is that parser generators were often very
clumsy to use. For the last three years we have been using beta
versions of a new grammar-based programming environment called
AnaGram that has greatly facilitated our work . AnaGram is not
just a parser generator, but as well provides a PC multi-window
environment for analyzing and debugging grammars. Figure 4 shows
the initial AnaGram screen.
You normally use your editor to create an
AnaGram syntax file, perhaps building on one of the sample files
included with the system. Then you enter the AnaGram environment
to analyze and debug your grammar, switching back to your editor
to make any changes. Finally you tell AnaGram to build the parser
and it generates a C/C++ source file for you to run through your
compiler.
One strength of AnaGram is the extensive help system. For example, if there are conflicts in your grammar you can call up a help screen that describes conflicts, explains the information in AnaGramís Conflicts window, and practically provides an on-line textbook on the subject of conflict handling (Figure 5). There are, literally, hundreds of help topics accessible either directly or from the Help Index window. We have found that this help facility greatly reduces the learning time for students starting out with grammar-based programming.
Another strong point is AnaGram's assistance
with grammar conflicts. Grammars can be ambiguous which means
that, in some state, there is an input that the grammar could
interpret in more than one way. Conflicts often represent "I
didn't think of that!" cases that can be dangerous. They
also occur in conventional programs but there they may go undetected.
In a grammar-based program, the parser generator can apply default
rules for handling conflicts, but the programmer is warned to
make sure the default is the right action to take. Good grammar-based
programming practice is to eliminate conflicts whenever possible.
Understanding a conflict is sometimes difficult since several
productions in the grammar may be involved. AnaGram offers several
tools to help, such as the Conflict Trace window. In the
example of Figure 6, the user has moved the cursor to one of the
items in the Conflicts window and has requested a trace.
AnaGram responds with a window showing that the grammar can get
into this conflict if the inputs are "!trace
on lines at", followed by an identifier,
followed by an identifier
start character. If the next input is a digit,
there would be two possible rules that could apply. With this
information you can find the offending sections of the grammar
and, usually, eliminate the conflict.
Perhaps the most useful feature of AnaGram
is its File Trace facility for debugging grammars. With
most conventional parser generators you cannot start debugging
until you build the C language parser and compile it. This takes
time. As well, the parser is often very difficult to debug since
it is machine generated and contains large, anonymous state transition
tables.
With AnaGram you can debug your program logic
by feeding data into your grammar without leaving the multi-window
environment. For example, in Figure 7, the user has asked for
a File Trace on the data file called DEBUG1.LOG.
The bottom window in the figure shows this data file; the highlighting
shows how far the user has progressed reading the data. The right
arrow key is used to read each additional token.
The top window shows the stack that the parser has created from
the data read so far. Thus we can see that we have identified
an initial line, and that on the second line the parser had identified
the keyword token "!trace
on lines at"
It is about to read the "RPN"
token.
As each token is read, the File Trace
shows how the stack grows and how productions are used to reduce
it. Most errors in logic can be very quickly identified. Most
important, there is no need to build the parser, compile, and
then work through a complicated debugging session.
AnaGram has some other advantages that are
illustrated in the example in Appendix I. With most earlier parser
generators, parsing had to be preceded by a lexical analysis step
that often required a separate tool. Lexical analysis scans the
input and breaks it up into tokens. It helps keep down the size
of parsing tables and does some "look-ahead" to avoid
conflicts in the grammar. However it requires the programmer to
use two tools instead of one, and to follow special conventions
to hook the lexical analyzer and the parser together.
AnaGram incorporates several techniques that
usually eliminate the need for a lexical analyzer. First, you
can specify useful sets of characters directly as part of the
grammar instead of leaving that to a lexical analyzer. For example
the syntax file in Appendix I has the following lexical definitions:
/* lexical definitions */
newline = '\n'
eof = -1 + 0 + ^Z
backslash = '\\'
percen = '%'
blank = ' '
digit = '0-9'
identifier start character = 'a-z'+'A-Z'+'_'+'$'+'@'
identifier follow character = identifier start character + digit
not percen = ~(percen + newline + eof)
not eol = ~(newline + eof)
Second, an AnaGram grammar can contain character strings that
are automatically treated as single keyword tokens by the parser.
AnaGram will do the "look ahead" needed to identify
these keywords and distinguish them from the rest of the input.
For example, the production for trace
line contains such a string:
trace line
-> "!trace on lines
at ", location, not eol?...
The main problem that we have encountered
with AnaGram has been that it only runs on PC's. Since a lot of
our work requires Unix, we have often had to develop the grammar
using an emulator or a separate machine. Fortunately the parsers
generated by AnaGram are portable and can compile and run on Unix
machines.
A second inconvenience is that the AnaGram
environment is function key driven. There are a large number of
different windows that can be used for different kinds of analysis
but it takes a little while to learn the function key combinations
to call up each one at the right time. The quick reference window
and the help system are a great aid here.
Finally, we should not give the impression
that grammar-based programming is completely trivial to learn.
While most computer science graduates will have seen all the relevant
theory in a compilers course, it takes some practical experience
to write good, clean grammars and to debug conflicts. But AnaGram
is certainly a better platform than most for getting that experience.
The author would be pleased to discuss grammar-based
programming with SERC affiliates. He may best be reached by electronic
mail at wilde@dcsuwf.dcsnod.uwf.edu.
AnaGram is available directly from:
Parsifal Software, P. O. Box 219
Wayland, Massachusetts 01778, U. S. A.
Tel (voice/fax): ++1-508-358-2564
or from the U.S. 800-879-2577
E-mail: jholland@world.std.com
URL:http://www.parsifalsoft.com
Enquiries about AnaGram can be directed to Jerry Holland at this address. Trial copies are available.
[BEN.88] Bentley, Jon, More Programming
Pearls: Confessions of a Coder, Chapter 9, Addison-Wesley,
Reading, MA, 1988.
[RICH.93] Richardson, Raymond and Wilde, Norman,
Applying Extensible Dependency Analysis: A Case Study of a
Heterogeneous System, report SERC-TR-62-F, Software Engineering
Research Center, CIS Department, University of Florida, Gainesville,
FL 32611, July 1993.
[WILD.93] Wilde, Norman "Testing Your
Objects", C Users Journal, Vol. 11, No. 5, May 1993.
[WILD.94] Wilde, Norman "Dealing With
Uncertain Inputs: Exploratory Software Engineering", C/C++
Users Journal, Vol. 12, No. 7, July 1994, pp. 25 - 39.
{
/* File: r2getvd.syn, By: N. Wilde, Nov. 14/93
Purpose: Get Vax Debugger input for RECON.
This is an AnaGram syntax file that generates r2getvd.c, which
converts a log file containing trace output from the VAX debugger
to a RECON "<xx>.r2t" trace file. To help
check consistency, r2getvd writes to standard output a list of
all the file names found in the log file that it could NOT
identify. If any log file line has an unexpected format, a message
is written to standard error.
Design: The Vax debugger log output contains, along with much else,
lines like the following:
!trace on lines at RPN\getop\%LINE 577+11
that indicate that line 577 of function getop in file RPN.C has
been executed. This input line should produce a r2t file line with:
004 577T
where 004 is the file index of file RPN.C. The mapping between file
indexes and file names is in the "my file" productions in this file.
R2getvd reads the debugger log file. When it finds a line of
the form shown above it looks up the corresponding file index and
writes the corresponding r2t file line. If the line is of the form
!trace on lines at XXXX ...
but XXXX is not one of the known file names, then XXXX is stored in
a list and written out at the end of processing in the list of
unidentified files.
*/
}
grammar
-> line?..., eof = writeUnidentifiedFileList();
line
-> trace line, newline
-> source line, newline
-> other line, newline
trace line
-> "!trace on lines at ", location, not eol?...
location
-> my file:i, function spec, line spec:n = writeR2tLine(i,n);
-> routine entry point
-> meaningless number
-> unidentified location
(int) my file
-> "RPN" = 004;
/* ADD other abbreviated file names here */
function spec
-> not percen...
(int) line spec
->"%LINE ", number:n = n;
routine entry point
-> "routine"
unidentified location
-> identifier = storeUnidentifiedFile(release());
meaningless number
-> number
source line
-> "! ", blank?..., number, ":", not eol?...
other line
-> error = wrError(PCB.line, PCB.column);
(int) number
-> digit:d = d - '0';
-> number:n, digit:d = 10*n + (d -'0');
identifier
-> identifier start character:c = collectFirst(c);
-> identifier, identifier follow character:c = collect(c);
/* lexical definitions */
newline = '\n'
eof = -1 + 0 + ^Z
backslash = '\\'
percen = '%'
blank = ' '
digit = '0-9'
identifier start character = 'a-z'+'A-Z'+'_'+'$'+'@'
identifier follow character = identifier start character + digit
not percen = ~(percen + newline + eof)
not eol = ~(newline + eof)
/* C Support Code */
{
}
/* ----------- main program ------------ */
void main(int argc, char * argv[]) {
int i;
char logFilePath[100];
char r2tFilePath[100];
for (i= 1; i < argc; i++){
if(0 == strncmp( argv[i], "-L", 2)){
strncpy(logFilePath, argv[i] + 2, 100);
if (NULL == (logFile = fopen(logFilePath, "r"))) {
fprintf(stderr, "R2GETVD - couldn't open log file %s\n", logFilePath);
return;
}
}
if(0 == strncmp( argv[i], "-R", 2)) {
strncpy(r2tFilePath, argv[i] + 2, 100);
if (NULL == (r2tFile = fopen(r2tFilePath, "w"))) {
printf(stderr, "R2GETVD - couldn't open r2t file %s\n", r2tFilePath);
return;
}
}
}
if ((NULL == r2tFile) || (NULL == logFile)) {
fprintf(stderr, "R2GETVD ABORTING\n");
return;
}
/* Write comment line of the r2t file */
fprintf(r2tFile, "From Vax Log file %s\n", logFilePath);
/* Call the parser to do the rest */
r2getvd();
/* Close all files and quit */
fclose(logFile);
fclose(r2tFile);
}