abnfgen

In memoriam Peter Naur
October 25 1928 - January 3 2016
and John Backus
December 3 1924 - March 17 2007

What is it?

Given an Augmented Backus-Naur form (ABNF) grammar (Wikipedia, RFC 4234), abnfgen quickly generates lots of random documents that match that grammar.

$ cat GRAMMAR
ring = 1*12("ding" SP) "dong" CRLF
$ ./abnfgen GRAMMAR
DInG DiNg doNg

That's useful for

  • verifying an ABNF grammar's syntactical correctness
  • illustrating a grammar with non-obvious examples
  • testing software that is supposed to consume documents described by a grammar

The tool is in alpha and, while copyrighted, free for any use.

Download

abnfgen-0.21.tar.gz is the current version. You'll need a C compiler and a Unix-like environment to build it.

Documentation

Unix-style man page for abnfgen(1).
For a quickreference, run abnfgen -h.

Differences to abnf

The input grammar accepted by abnfgen differs from ABNF in three ways:

'abc'   In ABNF, single quotes have no special meaning. With abnfgen input grammars, 'abc' means abc exactly as written, without upper/lower case options. I like case-insensitive protocols as much as the next person, but if I want to generate some spcific text, and I can't write it any other way than as hex, decimal, or binary numbers, the results will be unpleasant to read, write, and maintain.
(Some years after I wrote the preceding paragraph, the standards community has added a way of spelling case-insensitive literals using the syntax of %s"abc". If you want to stay conforming, use that and mention that you require support for RFC 7405.)

{N}   In ABNF, curly braces have no special meaning, either. With abnfgen, a number in curly braces in front of an alternative controls the chance that that branch is taken. The default value is 1. So, with coin = {2} heads / tails, the toss will come out "heads" in two cases out of three. This works for repetition branches, too; the branch to the left of the * is the shorter one, to the right, the longer. (If you're trying to make longer strings, you may also want to increase the maximum recursion depth using the -y runtime option.)

UTF-8   Numeric character codes (e.g. %d13) are interpreted as Unicode and emitted as UTF-8. (ABNF is wisely silent on its character encoding.) In retrospect, that's a bit of a hack. It'll come out the same for 7-bit ASCII, and make things easier if what you're trying to generate is UTF-8; otherwise, complain to me.

The "legal" command-line flag, -l, turns all of this off.

Author & Homepage

Jutta Degener <jutta@pobox.com> wrote abnfgen and maintains it.
The tool's homepage is at http://www.quut.com/abnfgen/.

Recent Changes

The release was last updated June 3rd, 2023

0.21   Stop ignoring {chance} annotations for repetitions.

0.20   Parenthesize macro arguments in expressions. (Yikes!)

0.19   Forgot to increment the version number for 0.18.

0.18   Fix the usage message. Thanks to Tony R. for the improved version.

0.17   Add support for RFC 7405 (%s"foo"); new option -7 turns it off. Thanks to Martin Bjorklund for sending me a patch!

0.16   New option -l ("legal") turns off extensions.

0.15   Handle EOF in '', "", and <> as an error, rather than looping forever.

0.14   Inside [..] or 0*N(..), abnfgen didn't always check resolvability prior to expansion.

0.13   Randomly expand [foo] as either foo or nothing. Up to version 0.12, abnfgen always picked foo if there was room for it.

0.12   Don't let "%d12-%d14" through - the correct syntax is "%d12-14".

0.11   Don't allow "_" in identifier names unless the new "-_" option is specified.

0.10   Don't treat any single count in front of a rule as if it were 1.
Document -u ("don't understand prose") option.

0.9   ABNF doesn't allow spaces in N*Mexpr; I did. Tighten the restrictions to make it easier for ABNF authors that want to check compliance.
Pre-load the core syntax for ALPHA BIT CHAR CR CRLF CTL DIGIT DQUOTE HEXDIG HTAB LF LWSP OCTET SP VCHAR WSP

0.8  Don't go into infinite loops when expanding grammars with recursive rules.

Other ABNF Software

Test case generators
Stéphane Bortzmeyer's Eusthatius is written in Haskell.

Parser generators
Lowell D. Thomas of Coast to Coast Research wrote APG, the ABNF parser generator. It produces a C++ class, given a grammar.
Tanaka Akira has a converter for ABNF to regexp in Ruby - for those instances where that's possible.

Verifiers
There's Bill Fenner's ABNF checker (for cut-and-pasted grammar) and
Chris Newman's abnf.c, a widely used validator (here's a fork on github).

Extractors
Bill Fenner's ABNF extractor, a small perl script that tries to guess which part of an Internet Draft or RFC is the ABNF.

Test case #1.
Test case #2.
Test case #3.
Test case #4.
Reality.

Demo: applying abnfgen to itself.

1. Download abnfgen.

That should leave you with a file abnfgen-0.21.tar.gz in a directory you control, for example /tmp.

2. Compile a local copy.

Decompress, untar, configure, and make:
$ gunzip < abnfgen-0.21.tar.gz | tar xf -
$ cd abnfgen-0.21
$ ./configure
creating cache ./config.cache
checking for a BSD compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
checking whether make sets ${MAKE}... yes
checking for working aclocal... found
checking for working autoconf... found
checking for working automake... found
checking for working autoheader... found
checking for working makeinfo... missing
checking for gcc... gcc
checking whether the C compiler (gcc ) works... yes
checking whether the C compiler (gcc ) is a cross-compiler... no
checking whether we are using GNU C... yes
checking whether gcc accepts -g... yes
checking for a BSD compatible install... /usr/bin/install -c
updating cache ./config.cache
creating ./config.status
creating Makefile
$ make
gcc -DPACKAGE=\"abnfgen\" -DVERSION=\"0.21\" -I. -I. -g -O2 -c abnfgen.c
gcc -DPACKAGE=\"abnfgen\" -DVERSION=\"0.21\" -I. -I. -g -O2 -c argv.c
gcc -DPACKAGE=\"abnfgen\" -DVERSION=\"0.21\" -I. -I. -g -O2 -c check.c
gcc -DPACKAGE=\"abnfgen\" -DVERSION=\"0.21\" -I. -I. -g -O2 -c coverage.c
gcc -DPACKAGE=\"abnfgen\" -DVERSION=\"0.21\" -I. -I. -g -O2 -c expression.c
gcc -DPACKAGE=\"abnfgen\" -DVERSION=\"0.21\" -I. -I. -g -O2 -c hash.c
gcc -DPACKAGE=\"abnfgen\" -DVERSION=\"0.21\" -I. -I. -g -O2 -c input.c
gcc -DPACKAGE=\"abnfgen\" -DVERSION=\"0.21\" -I. -I. -g -O2 -c mem.c
gcc -DPACKAGE=\"abnfgen\" -DVERSION=\"0.21\" -I. -I. -g -O2 -c nonterminal.c
gcc -DPACKAGE=\"abnfgen\" -DVERSION=\"0.21\" -I. -I. -g -O2 -c output.c
gcc -DPACKAGE=\"abnfgen\" -DVERSION=\"0.21\" -I. -I. -g -O2 -c renderchar.c
gcc -DPACKAGE=\"abnfgen\" -DVERSION=\"0.21\" -I. -I. -g -O2 -c scan.c
gcc -DPACKAGE=\"abnfgen\" -DVERSION=\"0.21\" -I. -I. -g -O2 -c symbol.c
gcc -DPACKAGE=\"abnfgen\" -DVERSION=\"0.21\" -I. -I. -g -O2 -c toutf8.c
gcc -g -O2 -o abnfgen abnfgen.o argv.o check.o coverage.o expression.o hash.o input.o mem.o nonterminal.o output.o renderchar.o scan.o symbol.o toutf8.o

3. Find the document that defines your grammar.

The grammar of ABNF itself is defined in RFC 4234. (That used to be 2234, but the authors published a new version in October 2005.) The text of that RFC is available from many servers, among them http://www.ietf.org/rfc/rfc4234.txt at the IETF.

4. Extract an ABNF grammar from the document.

In RFC 4234, the ABNF grammar is in section 4, "ABNF DEFINITION OF ABNF". Save that section in a file (let's call it "section-4.txt") and remove the page headers from the text (the three lines starting with "Crocker & Overell").
That'll get you something that looks like this: section-4.txt

5. Run abnfgen on the grammar.

$ ./abnfgen section-4.txt
;
;

    ;
  k =/ <>/ <>

So, as I'm writing this tutorial, abnfgen generated some empty comments and a rule adding two prose alternatives to a symbol named "k". If you're following along in real life, your case may vary. It's fairly common for the documents generated by abnfgen to be small - the alternatives that do nothing and the alternatives that do something in a grammar usually start out with about even weight.

It's also common for documents generated by abnfgen to be unfit for display on a terminal - many grammars allow control characters that one doesn't see much in practice.

5b ... lots of times.

Now that the basic grammar is working, let's generate 100 test documents.

$ ./abnfgen -d demo -n 100 section-4.txt

The -d option specifies the name of a directory. Abnfgen creates it if it doesn't exist yet. In the directory, there should now exist files demo/0001.tst, demo/0002.tst, and so on up to demo/0100.tst, each with a random ABNF grammar. Deciphering the meaning of these test files is an exercise in itself; it takes a while to recognize

=/8*" ";
604%D29-09
;f

as an ABNF grammar rather than line noise.

6. The test: run abnfgen on the grammars it generated.

The grammars are syntactically correct, but will have lots of semantic errors - missing definitions, definitions for nonterminals that aren't used, and so on. Abnfgen will rarely create output for them. So, for a simple test, all we need to do is run abnfgen with these grammars as input and see whether it crashes.

$ for i in demo/*
> do
> ./abnfgen $i > /dev/null 2>&1
> done
Segmentation fault
Segmentation fault

Well, that's what version 0.1 looked like - since version 0.2, abnfgen no longer crashes very much on its own input - most of the time.

(Two of the randomly generated grammars happened to have an infinite recursion in them; and abnfgen overflowed its stack following them.)

7. How to use this

This is a context-free grammar based general-purpose generator. It can do all kinds of generative things that you'd nowadays probably use AI for. You can make it generate HTML pages that suggest random startup ideas, write themed emoji landscapes, or send people spam or malware that looks just a little different each time.

I tend to use it to stress-test parsers I've written, and they ususally produce about a week's worth of painful crashes and bugs to find for about a day of work getting the grammar just right.

The first set of bugs are found even without running abnfgen at all, just by trying to write a formal grammar for a piece of software. That forces me to think through corner cases I've never bothered with before, and I'll invariably find that both my documentation and my understanding are incomplete.

One difference between what people usually think of as a grammar and abnfgen's input is the absence of white space in the grammar -- the teaching grammars usually take a tokenization process for granted that abnfgen doesn't know about. So you'll probably define an optional and non-optional white space nonterminal and sprinkle those wherever they need to go. If your language has comments, don't forget to include those in the white space.

Once you have a grammar, you'll try to produce a large set of files very quickly. If that doesn't go quickly, your rules probably tend towards producing bigger and bigger output. It's easy to write rules like that -- if you have a couple of expansions like

atom = 'x' / '(' expression ')'
expression = atom / expression '::' expression

you're going to on average get more expressions out than you put in, and that means the grammar expansion never terminates.

The way to fix this is to bias your rules towards shorter output:

atom = {10}'x' / '(' expression ')'
expression = {10}atom / expression '::' expression

The second set of bugs are found by running the parser over a large number of inputs very quickly. Those are not going to be valid inputs; any parse that produces a normal error message is fine. I usually redirect the parser output or log stream en masse and then quickly filter for runs that crash or produce some sort of stacktrace markedly different from the normal error messages.

To find the third set of bugs, start biasing the parser towards reusing the names of symbols. If you have a rule for producing an identfier name, say:

ident = ALPHA *(ALPHA / DIGIT / '_')

go ahead, throw a couple of familiar names in there and have them pop up repeatedly

ident = 'foo' / 'bar' / ALPHA *(ALPHA / DIGIT / '_')

This can catch a lot of very intricate cases where things need to line up just so.

The final set of bugs are found by occasionally adding complete noise in place of the syntactically correct grammar. Take most of your major nonterminals and add a low chance of noise:

ident = {10}(ALPHA *(ALPHA / DIGIT / '_')) / noise

and then write a rule for noise that just emits whatever you can stomach -- any byte, or any printable character, or a symbol-and-punctuation soup particularly geared towards interfering with what the system normally interprets as signal. (The white space rules are also great places to add a low chance of noise.)

noise = 1*10%x00-7f

It doesn't have to make sense, all you want it for the parser to survive. Or to resoundingly crash, leaving you happy to know that you found another bug through better test tools.

8. Areas for improvement

The handling of recursion depth is still broken. The system knows how to stop going deeper, but once it's shallow enough, it doesn't know to stay out of trouble -- so in a grammar that is biased towards going deeper, the system expansion hovers around exactly the allowed recursion depth and just keeps going forward if the grammar allows it.

The "full coverage" option is too easily satisfied when each production branch has been visited once. It might be helpful to strive for balance in context; basically, remember not just whether this branch has been visited, but what context it was visited in.

It would be useful to limit the size of the output rather than just the recursion depth. (Both are needed -- you could recurse indefinitely and never print anything, and that would be bad, too.)

The chance annotations probably don't quite work everywhere. Part of the problem is that in

foo = x / {12}*y

it's ambiguous whether the choice implied by * or by / is being annotated. I should probably make the space significant, so that {12} * annotates the / and {12}* annotates the *, but that hasn't happened yet.