Test case generators
Stéphane Bortzmeyer's
Eusthatius
is written in Haskell.
Parser generators
Lowell D. Thomas of Coast to Coast Research wrote
APG,
the ABNF parser generator. It produces a C++ class,
given a grammar.
Tanaka Akira has a
converter for ABNF
to regexp in Ruby - for those instances where that's
possible.
Verifiers
There's Bill Fenner's
ABNF checker
(for cut-and-pasted grammar) and
Chris Newman's
abnf.c, a widely used validator (here's
a fork on github).
Extractors
Bill Fenner's
ABNF
extractor, a small perl script that
tries to guess which part of an Internet Draft
or RFC is the ABNF.
| Test case #1. |
|
| Test case #2. |
|
| Test case #3. |
|
| Test case #4. |
|
| Reality. |
|
That should leave you with a file
abnfgen-0.21.tar.gz in a directory
you control, for example /tmp.
Decompress, untar, configure, and make:
$ gunzip < abnfgen-0.21.tar.gz | tar xf -
$ cd abnfgen-0.21
$ ./configure
creating cache ./config.cache
checking for a BSD compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking whether make sets ${MAKE}... yes
checking for working aclocal... found
checking for working autoconf... found
checking for working automake... found
checking for working autoheader... found
checking for working makeinfo... missing
checking for gcc... gcc
checking whether the C compiler (gcc ) works... yes
checking whether the C compiler (gcc ) is a cross-compiler... no
checking whether we are using GNU C... yes
checking whether gcc accepts -g... yes
checking for a BSD compatible install... /usr/bin/install -c
updating cache ./config.cache
creating ./config.status
creating Makefile
$ make
gcc -DPACKAGE=\"abnfgen\" -DVERSION=\"0.21\" -I. -I. -g -O2 -c abnfgen.c
gcc -DPACKAGE=\"abnfgen\" -DVERSION=\"0.21\" -I. -I. -g -O2 -c argv.c
gcc -DPACKAGE=\"abnfgen\" -DVERSION=\"0.21\" -I. -I. -g -O2 -c check.c
gcc -DPACKAGE=\"abnfgen\" -DVERSION=\"0.21\" -I. -I. -g -O2 -c coverage.c
gcc -DPACKAGE=\"abnfgen\" -DVERSION=\"0.21\" -I. -I. -g -O2 -c expression.c
gcc -DPACKAGE=\"abnfgen\" -DVERSION=\"0.21\" -I. -I. -g -O2 -c hash.c
gcc -DPACKAGE=\"abnfgen\" -DVERSION=\"0.21\" -I. -I. -g -O2 -c input.c
gcc -DPACKAGE=\"abnfgen\" -DVERSION=\"0.21\" -I. -I. -g -O2 -c mem.c
gcc -DPACKAGE=\"abnfgen\" -DVERSION=\"0.21\" -I. -I. -g -O2 -c nonterminal.c
gcc -DPACKAGE=\"abnfgen\" -DVERSION=\"0.21\" -I. -I. -g -O2 -c output.c
gcc -DPACKAGE=\"abnfgen\" -DVERSION=\"0.21\" -I. -I. -g -O2 -c renderchar.c
gcc -DPACKAGE=\"abnfgen\" -DVERSION=\"0.21\" -I. -I. -g -O2 -c scan.c
gcc -DPACKAGE=\"abnfgen\" -DVERSION=\"0.21\" -I. -I. -g -O2 -c symbol.c
gcc -DPACKAGE=\"abnfgen\" -DVERSION=\"0.21\" -I. -I. -g -O2 -c toutf8.c
gcc -g -O2 -o abnfgen abnfgen.o argv.o check.o coverage.o expression.o hash.o input.o mem.o nonterminal.o output.o renderchar.o scan.o symbol.o toutf8.o
The grammar of ABNF itself is defined
in RFC 4234. (That used to be 2234, but
the authors published a new version in October
2005.) The text of that RFC is available
from many servers, among them http://www.ietf.org/rfc/rfc4234.txt
at the IETF.
In RFC 4234, the ABNF grammar
is in section 4,
"ABNF DEFINITION
OF ABNF". Save that section in
a file (let's call it "section-4.txt")
and remove the page headers
from the text (the three lines
starting with "Crocker & Overell").
That'll get you something that looks
like this: section-4.txt
$ ./abnfgen section-4.txt
;
;
;
k =/ <>/ <>
|
So, as I'm writing this tutorial,
abnfgen generated some empty comments and
a rule adding two prose alternatives to a
symbol named "k".
If you're following
along in real life, your case may vary.
It's fairly common for the documents generated
by abnfgen to be small - the alternatives
that do nothing and the alternatives that
do something in a grammar usually start
out with about even weight.
It's also common for documents generated
by abnfgen to be unfit for display on
a terminal - many grammars allow control
characters that one doesn't see much
in practice.
Now that the basic grammar is working,
let's generate 100 test documents.
$ ./abnfgen -d demo -n 100 section-4.txt
|
The -d option specifies the name of a directory.
Abnfgen creates it if it doesn't exist yet.
In the directory, there should now exist
files demo/0001.tst, demo/0002.tst, and so on
up to demo/0100.tst, each with a random
ABNF grammar. Deciphering the meaning of
these test files is an exercise in itself;
it takes a while to recognize
as an ABNF grammar rather than line noise.
The grammars are syntactically
correct, but will have lots of semantic
errors - missing definitions, definitions
for nonterminals that aren't used, and so on.
Abnfgen will rarely create output
for them. So, for a simple test,
all we need to do is run abnfgen with these
grammars as input and see
whether it crashes.
$ for i in demo/*
> do
> ./abnfgen $i > /dev/null 2>&1
> done
Segmentation fault
Segmentation fault
|
Well, that's what version 0.1 looked like -
since version 0.2, abnfgen no longer crashes very much
on its own input - most of the time.
(Two of the randomly generated grammars happened
to have an infinite recursion in them; and abnfgen
overflowed its stack following them.)
This is a context-free grammar based general-purpose generator.
It can do all kinds of generative things that you'd
nowadays probably use AI for. You can make it
generate HTML pages that suggest random startup ideas,
write themed emoji landscapes, or send people
spam or malware that looks just a little different
each time.
I tend to use it to stress-test parsers I've written,
and they ususally produce about a week's worth of painful
crashes and bugs to find for about a day of work getting
the grammar just right.
The first set of bugs are found even without running
abnfgen at all, just by trying to write a formal
grammar for a piece of software.
That forces me to think through corner cases
I've never bothered with before, and I'll
invariably find that both my documentation
and my understanding are incomplete.
One difference between what people usually think of
as a grammar and abnfgen's input is the absence of
white space in the grammar -- the teaching grammars
usually take a tokenization process for granted that
abnfgen doesn't know about. So you'll probably define
an optional and non-optional white space nonterminal
and sprinkle those wherever they need to go.
If your language has comments, don't forget to include
those in the white space.
Once you have a grammar, you'll try to produce
a large set of files very quickly. If that doesn't
go quickly, your rules probably tend towards
producing bigger and bigger output. It's easy to
write rules like that -- if you have a couple of
expansions like
atom = 'x' / '(' expression ')'
expression = atom / expression '::' expression
|
you're going to on average get more expressions out
than you put in, and that means the grammar expansion
never terminates.
The way to fix this is to bias your rules
towards shorter output:
atom = {10}'x' / '(' expression ')'
expression = {10}atom / expression '::' expression
|
The second set of bugs are found by running the
parser over a large number of inputs very
quickly. Those are not going to be valid inputs;
any parse that produces a normal error message
is fine.
I usually redirect the parser output or log stream
en masse and then quickly filter for runs that crash
or produce some sort of stacktrace markedly
different from the normal error messages.
To find the third set of bugs, start biasing the
parser towards reusing the names of symbols.
If you have a rule for producing an identfier name, say:
ident = ALPHA *(ALPHA / DIGIT / '_')
|
go ahead, throw a couple of familiar names in there
and have them pop up repeatedly
ident = 'foo' / 'bar' / ALPHA *(ALPHA / DIGIT / '_')
|
This can catch a lot of very intricate cases where things
need to line up just so.
The final set of bugs are found by occasionally adding
complete noise in place of the syntactically correct grammar.
Take most of your major nonterminals and add a low chance of
noise:
ident = {10}(ALPHA *(ALPHA / DIGIT / '_')) / noise
|
and then write a rule for noise that just emits whatever you
can stomach -- any byte, or any printable character, or a
symbol-and-punctuation soup particularly geared towards
interfering with what the system normally interprets as signal.
(The white space rules are also great places to add a low
chance of noise.)
It doesn't have to make sense, all you want it for the
parser to survive. Or to resoundingly crash, leaving you happy
to know that you found another bug through better test tools.
The handling of recursion depth is still broken.
The system knows how to stop going deeper, but once it's shallow enough,
it doesn't know to stay out of trouble -- so in a grammar that
is biased towards going deeper, the system expansion hovers
around exactly the allowed recursion depth and just keeps going
forward if the grammar allows it.
The "full coverage" option is too easily satisfied when each
production branch has been visited once. It might be helpful
to strive for balance in context; basically, remember not just
whether this branch has been visited, but what context it was
visited in.
It would be useful to limit the size of the output rather than
just the recursion depth. (Both are needed -- you could recurse
indefinitely and never print anything, and that would be bad,
too.)
The chance annotations probably don't quite work everywhere.
Part of the problem is that in
it's ambiguous whether the choice implied by * or by / is being
annotated. I should probably make the space significant, so
that {12} * annotates the / and {12}* annotates
the *, but that hasn't happened yet.