HTML is no TML

HTML is unsuitable for marking up plain English text.  This may surprise those who think that most of the documents on the World-Wide Web contain English text and are marked up using HTML.  But the overwhelming majority of these texts, even - especially - when presented with the most sophisticated bit-mapped client program, are full of typesetting errors.  These errors make reading painful for anyone who still occasionally opens a book.

This essay is intended to turn a numb and vague pain of ``this looks bad'' into a sharp and stingy pain of ``this should be a hyphen,'' or ``there should be a little more white space here.''

I have no formal typographical education.  I have a year's worth of experience with attempting to mark up documents without making them worse than the ASCII copy I had received.  I think I'm almost there, now, but it was tricky, and I often find myself forced to specify layout where I would much rather specify structure.

None of the following points apply to or criticize SGML, the much maligned general markup language that HTML instantiates.

Spacing between sentences

Some people like extra space at the end of a sentence.  To them, text without this extra space looks crowded; that's why, even in pure character documents, writers often use two space characters between sentences. (Others don't, and think this is a mannerism.)

Much of this spacing can be inferred (and has been inferred by typesetting systems such as troff and TeX for years).  But if this inference takes place, there needs to be a way of marking exceptions; e.g., places where a dot followed by white space and an upper-case letter does not constitute a full stop.

HTML has no support for sentences; they don't exist as an object or a concept in the document.  If, at this late point, WWW client programs would begin to infer sentence boundaries from white space, the HTML text will not contain the necessary exceptions and will still look wrong occasionally.

Unless the author explicitly requests French spacing, I used to emulate proper spacing using sequences of an ISO 8859-1 non-breaking space (&#160;) that I heard about on USENET a few years ago, and a typewritten (and hence, accidentally, wider) space (<tt> </tt>).  They happen to print as two spaces in character-based browsers and when cutting-and-pasting text from my favorite bitmapped client, and they happen to be almost the right size when printed on a bit-mapped terminal.  A normal double-space, achievable as `` &#160; '' on my bit-mapped client, would have been even closer, but prints as three spaces on the character-based frontend.  Using inlined images with an alt="  " text of two spaces is out of the question, since cutting and pasting the displayed text from a bitmapped browser will capture neither the alt text nor (of course) the intended white space.

The HTML+ specification included an &emsp; entity reference that would have remedied the problem.  There is no &emsp; in the HTML 2.0 specification that supposedly grew out of the HTML+ specification.

The HTML+ specification included an &nbsp; entity reference to a non-breaking space.  The reference was never implemented in the common `Mosaic' browser, and consequently rarely occurs outside of ``lists of entity references in HTML.''  &nbsp; does occur frequently in the Arena reference documents on HTML 3.0.

Hyphens and dashes

The default character you see on screen for a typed ASCII minus sign (-) is an en-dash.  En-dashes are comparably rare in written English text; they occur in dates and times, ranges, and in mathematical expressions.  The characters in hyphenated words are, well, hyphens, and written as the entity-reference &#173; in HTML.  A character that occurs more often is the em-dash, which has no representation in HTML, and cannot be emulated using ``inlined'' images either since it doesn't show up when cutting and pasting texts.  (I emulate it using an en-dash surrounded by spaces.)

The HTML+ specification included &mdash; and &ndash; entity references that would have freed the conventional hyphen to be rendered as a ``real'' hyphen..  There is no &mdash; or &ndash; in the HTML 2.0 specification that supposedly grew out of the HTML+ specification.

Small spaces

A half space stands after the first dot in ``e.g.'' and ``i.e.'', and between a sum of money and its currency.  A quarter space is occasionally used to separate triplets in long numbers.  The space around type-written example words should set them off from the surrounding text without separating them completely.  The dots in an ellipsis (``...'') require a small amount between them as well.  These rules are old and well-known; the conventions for typesetting the spaces in character text (you omit them) are simple.  Yet I cannot specify them in HTML.

Quotes

I cannot specify proper (distinct) leading and closing double quotes in HTML.  The approximations, `` and '', print as two single quotes both on bitmapped clients and in ASCII, where they should be mapped to less obtrusive double-quote characters (")).

Unicode

If all you want to do is specify a layout, the problems so far could have been remedied by using Unicode as a basic character set, rather than ISO8859-1.  Unicode has all the spaces and characters you'll ever need for linear text.  I don't know why it didn't get used, or whether it ever was seriously discussed.


Last edit by jutta@cs.tu-berlin.de, February 13, 1995.