"Character references" explained

There is quite some confusion around "character references" such as ä or ä in HTML. In particular, the topic is presented in a misleading way in the HTML 4 specification. This document tries to clear these things up, by explaining them "as simply as possible, but not simpler". I'm afraid I have partly violated the latter rule.

There are two different things: character references and entity references. They are often called "numeric character references" and "symbolic character references", but this is confusing: it lumps different things together just because they are used for somewhat similar purposes.

ä is a character reference

A notation such as ä is a character reference. It denotes the character which occupies code position 228 (decimal) in the character code established as a reference. In this context, character code means a fixed mapping from a set of integers to a set of characters. In HTML, the code is fixed to Unicode. In Unicode, positions 0 through 255 are the same as in ISO-8859-1 alias ISO Latin 1, which in turn contains Ascii as a submit in positions 0 through 127.

Originally, in HTML 2.0, the "document character set" (which is what determines how character references are interpreted) was defined as ISO 8859-1 (or ECMA 94, to be very pedantic), with a clearly expressed aim that when the "document character set" will be expanded in future versions, it will agree with ISO 10646. And this is what has happened. In HTML 4.01, the "document character set" consists of the first 17 planes of ISO 10646. This standard has been completely harmonized with the Unicode standard, with the agreement that only the first 17 planes will ever be used. Unicode is a more commonly known name.

In addition to decimal (base 10) notation, hexadecimal (base 16) notation could be used, but then the letter x is needed before the number. For example, ä is equivalent to ä, since number 228 is e4 in hexadecimal. It would often be natural to use hexadecimal notation, since such notation is commonly used in character code standards, such as ISO 10646. However, character references with hexadecimal notation are less widely supported than character references with decimal notation.

For example, consulting the Unicode standard, we find that ä denotes the character "latin small letter a with diaeresis", or, informally, "a with two dots". This does not depend on the encoding of the document (such as Ascii, ISO-8859-1, KOI8-R, or Mac encoding), or on anything else. However, some Web browsers may get these things wrong. Most probably not for ä but for some other character references. Usually &#n; works well for n less than 256 but there are problems for larger values of n. Those problems can be very important, but they are outside the scope of this description.

Note: In Unicode, positions 128 through 159 have been assigned to control codes. Technically, something like € is not invalid (so a validator will not report it as an error) but it is undefined (so browsers may do whatever they like with it, and there is no definition for what is correct behavior). The so-called Windows characters (like smart quotes and em and en dashes) have been assigned other positions in Unicode, not those corresponding to their positions in the Windows character code.

A character reference could be called "numeric character reference", if desired. But when used to suggest that there are some other character references in HTML as well, it's misleading. (In SGML, there are "named character references" such as &#SPACE;, but they are a different issue, and they cannot be used in HTML.)

ä is an entity reference

An entity reference like ä has a meaning that has specifically been assigned to it in HTML. Generally, such a reference expands to a various expressions. This mechanism is part of SGML and very powerful, but HTML happens to use it just in very simple ways. Each of the entities defined in HTML specifications has an expansion that is a character reference, such as ä. So in practical terms, you might view ä as a "predefined symbolic name" for ä. That's a bit similar to defining symbolic names for constants in programming.

The fact that in HTML, both character references and entity references begin with the ampersand symbol is just coincidental, or a pragmatic choice, rather than any deep connection. Arjun Ray has explained, in a Usenet posting, the confusion between character references and entity references as follows, after an explanation of a confusion around the symbol "<":

A similar jump to a wrong conclusion is with entity references and character references, thanks to ERO ("Entity Reference Open") and CRO ("Character Reference Open") sharing the same first character in the RCS bindings (respectively, "&" and "&#").

The reason for the RCS having such overlaps is to minimize the number of ordinary characters that would need to be escaped in some fashion to prevent their interpretation as the start of some markup sequence. It was a reasonable economization. Only it presumed that users would have a clue to what they were doing.

It is a common misconception to think that entity references are more universal than character references. A notation like &auml; looks more system-independent than &#228;. But it isn't. It's just a name for it, so to say.

Terminating semicolon recommended

It is almost always a good idea to terminate a character reference and an entity reference with a semicolon. According to certain rules, the semicolon can be omitted in some cases, but it is never incorrect to use it. And in quite a few cases, it is required. Basically, the next character is a letter, the semicolon is needed. If you wrote &aumlh then a correctly behaving browser would treat it as a single (and currently undefined) entity reference, not as the reference &auml followed by the letter h.

In XHTML (and in XML in general), the semicolon is always required. Reference: XML specification, section 4.1 Character and Entity References.

Moreover, Internet Explorer 7 has a bug in parsing entity and character references: without a semicolon, some of them are not recognized but displayed literally, even in a contexts where the semicolon is optional. The bug is possibly an intentional "feature": XHTML restrictions are imposed, even on non-XHTML documents and even though IE 7 does not really understand XHTML! The following table illustrates the scope of the bug.

Handling of semicolon-less references on IE 7
Type of reference Example DisplayIE 7 behavior
Latin 1 entity &eacuteéCorrect
Other entity &mdash&mdashWrong, shown literally
Decimal &#8212Correct
Hexadecimal &#x2014Wrong, shown literally

On the other hand, especially when URLs with query parts, such as
http://www.server.example/cgi-bin/x.pl?foo=bar&copy=42
occur in HTML documents, you need to take into account that an HTML parser (when behaving by the specifications) will recognize anything that starts with & and a letter as an entity reference, even if it is not terminated by a semicolon. In the example case, &copy is taken as an entity reference that denotes &#169; which in turn denotes the copyright sign ©, which is not allowed in a URL, but I digress. To prevent this, replace that & by &amp; which is an entity reference that denotes &#38; that denotes the & character as such (i.e. as not to be taken as a constituent of a character reference or an entity reference).


Clarification: notations for characters in CSS

The "character references" discussed above should not be confused with the ways in which characters can be denoted in Cascading Style Sheets, CSS. The rules for specifying characters in CSS2 include the possibility of referring to any Unicode character using the notation
"\n "
(note the space!), where n is the hexadecimal code number. Thus, CSS uses a notation that is quite different from the HTML (and SGML and XML) notation (&#n;). This could be used, for example, in a CSS rule like
a:link:before {content: "\2192 ";}
which suggests that any unvisited link be preceded by a rightwards arrow. The character denoted by "\2192 " in CSS is the same as the character denoted by &#8594; in HTML; here's that character between quotation marks, if your browser can display it: "→". (Hexadecimal 2192 is 8594 in decimal.)

The document at hand is actually served with a style sheet containing such a rule. This part of CSS is ignored by IE 6, correctly supported by Mozilla 1.0, and partly incorrectly supported by Opera 6.03, which applies the rule to visited links, too. (This applies to the Windows versions of those browsers at least.)


For a different presentation of this topic, in a more general framework, see section "Escape" notations ("meta notations") for characters of my character code tutorial. For a list of entiTables of character entity references in HTML 4.0.