More Maintainable Code with Grammars and the AnaGram Programming Environment

Norman Wilde

University of West Florida

Pensacola, Florida 32514, USA

tel. 904-474-2548; fax. 904-474-3023

e-mail: wilde@dcsuwf.dcsnod.uwf.edu

Executive Summary

For the last seven years our research group has been working on tools to aid the maintainer of old code. Not surprisingly, we have developed a keen interest in ways of making our own code easier to understand and maintain. Recently we have found that the technique of grammar based programming seems to provide good, maintainable solutions to many of our programming problems.

Programs built around grammars seem to have several benefits that enhance maintainability. They are readable since the grammar acts as a working formal specification of program behavior. They are robust since the grammar eliminates many logic and " I didn't think of that " errors. They are flexible, and can be easily grown or extended as experience accumulates. And finally they are portable, since they use ASCII text and standard C/C++ programming techniques.

We have used grammar-based programming for several applications that do not much resemble traditional compiler construction tasks. One system generated test drivers for unit testing of C++ object classes. Another was a querying system to help maintainers extract information about large software systems. Still others have taken output from existing code analysis tools and translated it to work with our program understanding tools.

For the last three years we have been using the AnaGram grammar-based programming environment and we have found that it greatly helps us in developing these applications quickly. AnaGram is a PC multi-window system for writing, analyzing, and debugging programs that use grammars. We have had several good experiences with inexperienced students who produced correct, readable code using AnaGram.

This report, which describes some of the benefits of grammar-based programming and the AnaGram environment, was originally prepared at the request of SERC affiliate Siemens Corporate Research for the Siemens CASELAB system, which distributes information on CASE tools and methods to Siemens' employees worldwide. The views presented are exclusively those of the author.

This report may be cited as SERC-TR--76F, Software Engineering Research Center, University of Florida, CSE-301, Gainesville, FL 32611, July/94


1. Grammar-Based Programming

When you say the word "parsing" to most programmers the immediate word-association response is "compiler writing". Computer Science students are usually introduced to grammars in a compilers course that focuses on this one narrow application of parsing tools. Since few programmers write compilers after leaving university, we tend to think of parsing as a solution to somebody else's problem.

But parsing technology can be applied in a wide range of situations in which a program is controlled by a sequence of inputs. Parser generators can thus be used to provide clean, maintainable code to solve many programming problems.

A parser generator reads a syntax file containing a grammar and associated actions specified for both normal and error inputs. From these it creates a C or C++ source program (Figure 1). This generated program, when compiled and executed, will accept sequences of inputs that match the grammar and carry out the corresponding actions. If the inputs do not match the grammar, the program will instead perform the specified error handling actions.


Figure 1 - How a Parser Generator Works

2. Maintainability Benefits of Grammar-Based Programming

The main benefits of organizing a program around a grammar are readability, robustness, flexibility and portability. Together these make for more maintainable code.

Readability:

It is generally easy to read and understand a grammar based program because the grammar explains most of the complicated logic the reader might need to worry about. For example, the key part of the AnaGram syntax file in Appendix I, is the following:


grammar

-> line?..., eof = writeUnidentifiedFileList();

line

-> trace line, newline

-> source line, newline

-> other line, newline

trace line

-> "!trace on lines at ", location, not eol?...

location

-> my file:i, function spec, line spec:n = writeR2tLine(i,n);

-> routine entry point

-> meaningless number

-> unidentified location


This tells us immediately that the input (called grammar) is expected to be a sequence of lines and that each line may be a trace line, a source line, or an other line. Trace lines start with the string "!trace on lines at " which is followed by location information. If the location is composed of a file name, followed by a function name followed by a line number then the writeR2tLine function is called. Other cases, such as a routine entry point are ignored.

While we may not understand immediately what an "r2t" line is, the overall behavior of the program is very clear. This program is scanning the input for trace lines and writing something when it finds each one. The grammar provides an excellent top down specification of what it expects as input and what it will do when it gets it.

Robustness:

In a conventional program it can be very difficult to check that all contingencies have been handled. What if there are extra blanks in the input? What if the last input line is not complete? Special cases such as these are often found during testing and handled by patching the program's logic. After a few such patches the program becomes incomprehensible.

In a grammar based program, the grammar defines behavior for all such cases. The parser generator then checks for consistency and warns the programmer of possible errors. We have found that many of our logic errors are caught by the parser generator and relatively few get through into testing of the final program.

Flexibility:

Grammar based programs can often be extended easily to handle new cases. Often the addition of a couple of alternatives to the grammar is sufficient to take care of a new requirement. Such a change could be very complex to introduce in a conventional program because so many of the program's states could be affected. But the parser generator hides this kind of complication and generates a new, correct, program from the modified grammar.

This flexibility can be especially useful in exploratory programming [WILD.94]. For example, the program in Appendix I takes output from a debugger and translates it for use in one of our projects. The exact form of the debugger's output was known only from examples. As a first cut at the problem the grammar in the appendix was written from one of these examples. The error handling features of the parser generator can then be used to systematically develop a complete and correct program.

The grammar contains the following production for error handling:

other line

-> error = wrError(PCB.line, PCB.column);

The token error covers "anything else not described by this grammar". Any input line that does not match the grammar will produce a call to the wrError function that writes a message to standard error. The offending input can then be tracked down and added to the grammar.

We have used this technique several times. The result is a program that is easy to understand despite the many revisions. The grammar acts as a formal but readable specification at all stages of the process.

Portability:

Grammar-based programs usually are designed to process simple ASCII text as input and thus tend to be easily portable. So far, AnaGram generated programs have worked with all of the C or C++ compilers we have tried. Since it can be difficult to predict all the environments a program may eventually face, a conservative grammar-based design can reduce the life-cycle cost of repeated porting.

3. Some Applications

In this section we will describe some of the problems we have solved using grammar-based programming as part of our tool-building work. There are, of course, many other situations in which this technique can be useful.

3.1 Data Transformation Problems

To save development time, we try to make our tools use existing tools as much as possible. However this frequently forces us to write data transformation programs. For example the syntax file in Appendix I generated a program to transform output from the Digital Equipment Vax debugger for some experiments with a program understanding tool.


Figure 2 - A Simple Data Transformation Problem

Such transformation programs can be written quickly using grammar-based programming and, as described, error handling facilities of the parser generator allow systematic exploratory programming without making a rat's nest of the program logic.

We have written several such programs as part of our maintenance tool set, including one to extract information from a C++ code analyzer and another to normalize Prolog facts in a C code factbase.

3.2 Program Querying and Unit Testing of C++ Objects

Jon Bentley in "More Programming Pearls" describes the technique of inventing "little languages" to solve problems [BEN.88]. Instead of writing a program with a large number of switches and parameters for the user to set, the programmer defines a language in which the user can express what needs to be done. Then a parser generator can be used to produce the little language program to actually solve the user's problem (Figure 3).

In one project we applied this method to define a simple query language that software maintainers could use to get information about the code they were maintaining [RICH.93]. The query program was written in AnaGram by an undergraduate student and was one of his first C language projects. Yet the final code was very easy to inspect and was almost completely error free.


Figure 3 - Programming Using a "Little Language"

In another project, we developed a tool to help unit test C++ object classes. A little language was defined to specify objects and test conditions. The generated testing tool could read test specifications in the language and generate a C++ test driver program that would create the objects and step them through combinations of the specified conditions [WILD.93].

4. The AnaGram Environment for Grammar-Based Programming

4.1 AnaGram Advantages

One reason that grammar-based programming has been little applied is that parser generators were often very clumsy to use. For the last three years we have been using beta versions of a new grammar-based programming environment called AnaGram that has greatly facilitated our work . AnaGram is not just a parser generator, but as well provides a PC multi-window environment for analyzing and debugging grammars. Figure 4 shows the initial AnaGram screen.


Figure 4 - The Initial AnaGram Screen

You normally use your editor to create an AnaGram syntax file, perhaps building on one of the sample files included with the system. Then you enter the AnaGram environment to analyze and debug your grammar, switching back to your editor to make any changes. Finally you tell AnaGram to build the parser and it generates a C/C++ source file for you to run through your compiler.

One strength of AnaGram is the extensive help system. For example, if there are conflicts in your grammar you can call up a help screen that describes conflicts, explains the information in AnaGramís Conflicts window, and practically provides an on-line textbook on the subject of conflict handling (Figure 5). There are, literally, hundreds of help topics accessible either directly or from the Help Index window. We have found that this help facility greatly reduces the learning time for students starting out with grammar-based programming.


Figure 5 - An AnaGram Help Window

Another strong point is AnaGram's assistance with grammar conflicts. Grammars can be ambiguous which means that, in some state, there is an input that the grammar could interpret in more than one way. Conflicts often represent "I didn't think of that!" cases that can be dangerous. They also occur in conventional programs but there they may go undetected. In a grammar-based program, the parser generator can apply default rules for handling conflicts, but the programmer is warned to make sure the default is the right action to take. Good grammar-based programming practice is to eliminate conflicts whenever possible.

Understanding a conflict is sometimes difficult since several productions in the grammar may be involved. AnaGram offers several tools to help, such as the Conflict Trace window. In the example of Figure 6, the user has moved the cursor to one of the items in the Conflicts window and has requested a trace. AnaGram responds with a window showing that the grammar can get into this conflict if the inputs are "!trace on lines at", followed by an identifier, followed by an identifier start character. If the next input is a digit, there would be two possible rules that could apply. With this information you can find the offending sections of the grammar and, usually, eliminate the conflict.


Figure 6 - Pinpointing a Grammar Conflict

Perhaps the most useful feature of AnaGram is its File Trace facility for debugging grammars. With most conventional parser generators you cannot start debugging until you build the C language parser and compile it. This takes time. As well, the parser is often very difficult to debug since it is machine generated and contains large, anonymous state transition tables.

With AnaGram you can debug your program logic by feeding data into your grammar without leaving the multi-window environment. For example, in Figure 7, the user has asked for a File Trace on the data file called DEBUG1.LOG. The bottom window in the figure shows this data file; the highlighting shows how far the user has progressed reading the data. The right arrow key is used to read each additional token.

The top window shows the stack that the parser has created from the data read so far. Thus we can see that we have identified an initial line, and that on the second line the parser had identified the keyword token "!trace on lines at" It is about to read the "RPN" token.

As each token is read, the File Trace shows how the stack grows and how productions are used to reduce it. Most errors in logic can be very quickly identified. Most important, there is no need to build the parser, compile, and then work through a complicated debugging session.


Figure 7 - An AnaGram File Trace

AnaGram has some other advantages that are illustrated in the example in Appendix I. With most earlier parser generators, parsing had to be preceded by a lexical analysis step that often required a separate tool. Lexical analysis scans the input and breaks it up into tokens. It helps keep down the size of parsing tables and does some "look-ahead" to avoid conflicts in the grammar. However it requires the programmer to use two tools instead of one, and to follow special conventions to hook the lexical analyzer and the parser together.

AnaGram incorporates several techniques that usually eliminate the need for a lexical analyzer. First, you can specify useful sets of characters directly as part of the grammar instead of leaving that to a lexical analyzer. For example the syntax file in Appendix I has the following lexical definitions:


/* lexical definitions */

newline = '\n'

eof = -1 + 0 + ^Z

backslash = '\\'

percen = '%'

blank = ' '

digit = '0-9'

identifier start character = 'a-z'+'A-Z'+'_'+'$'+'@'

identifier follow character = identifier start character + digit

not percen = ~(percen + newline + eof)

not eol = ~(newline + eof)


Second, an AnaGram grammar can contain character strings that are automatically treated as single keyword tokens by the parser. AnaGram will do the "look ahead" needed to identify these keywords and distinguish them from the rest of the input. For example, the production for trace line contains such a string:

trace line

-> "!trace on lines at ", location, not eol?...

4.2 AnaGram Disadvantages

The main problem that we have encountered with AnaGram has been that it only runs on PC's. Since a lot of our work requires Unix, we have often had to develop the grammar using an emulator or a separate machine. Fortunately the parsers generated by AnaGram are portable and can compile and run on Unix machines.

A second inconvenience is that the AnaGram environment is function key driven. There are a large number of different windows that can be used for different kinds of analysis but it takes a little while to learn the function key combinations to call up each one at the right time. The quick reference window and the help system are a great aid here.

Finally, we should not give the impression that grammar-based programming is completely trivial to learn. While most computer science graduates will have seen all the relevant theory in a compilers course, it takes some practical experience to write good, clean grammars and to debug conflicts. But AnaGram is certainly a better platform than most for getting that experience.

5. For Further Information

The author would be pleased to discuss grammar-based programming with SERC affiliates. He may best be reached by electronic mail at wilde@dcsuwf.dcsnod.uwf.edu.

AnaGram is available directly from:

Parsifal Software, P. O. Box 219

Wayland, Massachusetts 01778, U. S. A.

Tel (voice/fax): ++1-508-358-2564

or from the U.S. 800-879-2577

E-mail: jholland@world.std.com

URL:http://www.parsifalsoft.com

Enquiries about AnaGram can be directed to Jerry Holland at this address. Trial copies are available.

Bibliography

[BEN.88] Bentley, Jon, More Programming Pearls: Confessions of a Coder, Chapter 9, Addison-Wesley, Reading, MA, 1988.

[RICH.93] Richardson, Raymond and Wilde, Norman, Applying Extensible Dependency Analysis: A Case Study of a Heterogeneous System, report SERC-TR-62-F, Software Engineering Research Center, CIS Department, University of Florida, Gainesville, FL 32611, July 1993.

[WILD.93] Wilde, Norman "Testing Your Objects", C Users Journal, Vol. 11, No. 5, May 1993.

[WILD.94] Wilde, Norman "Dealing With Uncertain Inputs: Exploratory Software Engineering", C/C++ Users Journal, Vol. 12, No. 7, July 1994, pp. 25 - 39.

Appendix I

A Sample AnaGram Syntax File (Simplified)
{
/* File: r2getvd.syn, By: N. Wilde, Nov. 14/93
Purpose: Get Vax Debugger input for RECON.
This is an AnaGram syntax file that generates r2getvd.c, which
converts a log file containing trace output from the VAX debugger
to a RECON "<xx>.r2t" trace file. To help
check consistency, r2getvd writes to standard output a list of
all the file names found in the log file that it could NOT
identify. If any log file line has an unexpected format, a message
is written to standard error.

Design: The Vax debugger log output contains, along with much else,
lines like the following:
!trace on lines at RPN\getop\%LINE 577+11
that indicate that line 577 of function getop in file RPN.C has
been executed. This input line should produce a r2t file line with:
   004 577T
where 004 is the file index of file RPN.C. The mapping between file
indexes and file names is in the "my file" productions in this file. 

R2getvd reads the debugger log file. When it finds a line of
the form shown above it looks up the corresponding file index and
writes the corresponding r2t file line. If the line is of the form
!trace on lines at XXXX ...
but XXXX is not one of the known file names, then XXXX is stored in
a list and written out at the end of processing in the list of
unidentified files.
*/
}
grammar
-> line?..., eof = writeUnidentifiedFileList();

line
-> trace line, newline
-> source line, newline
-> other line, newline

trace line
-> "!trace on lines at ", location, not eol?...


location
-> my file:i, function spec, line spec:n = writeR2tLine(i,n);
-> routine entry point
-> meaningless number
-> unidentified location

(int) my file
-> "RPN" = 004;
/* ADD other abbreviated file names here */

function spec
-> not percen... 

(int) line spec
->"%LINE ", number:n = n;

routine entry point
-> "routine" 

unidentified location
-> identifier = storeUnidentifiedFile(release());

meaningless number
-> number

source line
-> "! ", blank?..., number, ":", not eol?...

other line
-> error = wrError(PCB.line, PCB.column);

(int) number
-> digit:d = d - '0';
-> number:n, digit:d = 10*n + (d -'0');

identifier
-> identifier start character:c = collectFirst(c);
-> identifier, identifier follow character:c = collect(c);

/* lexical definitions */
newline = '\n'
eof = -1 + 0 + ^Z
backslash = '\\'
percen = '%'
blank = ' '
digit = '0-9'
identifier start character = 'a-z'+'A-Z'+'_'+'$'+'@'
identifier follow character = identifier start character + digit
not percen = ~(percen + newline + eof)
not eol = ~(newline + eof)

/* C Support Code */
{
}
/* ----------- main program ------------ */
void main(int argc, char * argv[]) {
  int i;
  char logFilePath[100];
  char r2tFilePath[100];
  for (i= 1; i < argc; i++){
    if(0 == strncmp( argv[i], "-L", 2)){
      strncpy(logFilePath, argv[i] + 2, 100);
      if (NULL == (logFile = fopen(logFilePath, "r"))) {
        fprintf(stderr, "R2GETVD - couldn't open log file %s\n", logFilePath);
        return;
      }
    }
    if(0 == strncmp( argv[i], "-R", 2)) {
      strncpy(r2tFilePath, argv[i] + 2, 100);
      if (NULL == (r2tFile = fopen(r2tFilePath, "w"))) {
        printf(stderr, "R2GETVD - couldn't open r2t file %s\n", r2tFilePath);
        return;
      }
    }
  }
  if ((NULL == r2tFile) || (NULL == logFile)) {
    fprintf(stderr, "R2GETVD ABORTING\n");
    return;
  }
    /* Write comment line of the r2t file */
    fprintf(r2tFile, "From Vax Log file %s\n", logFilePath);
    /* Call the parser to do the rest */
    r2getvd();
    /* Close all files and quit */
    fclose(logFile);
    fclose(r2tFile);
  }