There is quite some confusion around "character references"
such as ä
or ä
in HTML. In particular, the topic
is presented in a misleading way in the
HTML 4 specification.
This document tries to clear these things up, by explaining
them "as simply as possible, but not simpler". I'm afraid
I have partly violated the latter rule.
There are two different things: character references and entity references. They are often called "numeric character references" and "symbolic character references", but this is confusing: it lumps different things together just because they are used for somewhat similar purposes.
ä
is a character referenceA notation such as ä
is a
character reference. It denotes
the character which occupies code position 228 (decimal)
in the character code established as a reference. In this context,
character code means a fixed mapping from a set of integers to
a set of characters. In HTML, the code is
fixed to
Unicode.
In Unicode, positions 0 through 255 are the same as in ISO-8859-1
alias ISO Latin 1, which in turn contains Ascii as a submit in
positions 0 through 127.
Originally, in HTML 2.0, the "document character set" (which is what determines how character references are interpreted) was defined as ISO 8859-1 (or ECMA 94, to be very pedantic), with a clearly expressed aim that when the "document character set" will be expanded in future versions, it will agree with ISO 10646. And this is what has happened. In HTML 4.01, the "document character set" consists of the first 17 planes of ISO 10646. This standard has been completely harmonized with the Unicode standard, with the agreement that only the first 17 planes will ever be used. Unicode is a more commonly known name.
In addition to decimal (base 10) notation, hexadecimal (base 16)
notation could be used, but then the letter x
is needed
before the number. For example, ä
is equivalent
to ä
, since number 228 is e4 in hexadecimal.
It would often be natural to use hexadecimal notation, since such notation
is commonly used in character code standards, such as ISO 10646.
However, character references with hexadecimal notation are less widely
supported than character references with decimal notation.
For example, consulting the Unicode standard, we find that ä
denotes the character "latin small letter a with diaeresis", or,
informally,
"a with two dots". This does not depend on the encoding of the document
(such as Ascii, ISO-8859-1, KOI8-R, or Mac encoding), or on anything else.
However, some Web browsers may get these things wrong. Most probably not
for ä
but for some other character references. Usually &#n;
works well
for n less than 256 but there are problems for larger values of n.
Those problems can be very important, but they are outside the scope
of this description.
Note: In Unicode, positions 128 through 159 have been assigned to
control codes. Technically, something like €
is not invalid
(so a validator will not report it as an error) but it is undefined
(so browsers may do whatever they like with it, and there is no
definition for what is correct behavior). The so-called Windows
characters (like smart quotes and em and en dashes) have been
assigned other positions in Unicode, not those corresponding to
their positions in the Windows character code.
A character reference could be called "numeric character reference", if desired. But when used to suggest that there are some other character references in HTML as well, it's misleading. (In SGML, there are "named character references" such as SPACE;, but they are a different issue, and they cannot be used in HTML.)
ä
is an entity referenceAn entity reference
like ä
has a meaning that has specifically
been assigned to it in HTML. Generally, such a reference expands to
a various expressions. This mechanism is part of SGML and very
powerful, but HTML happens to use it just in very simple ways.
Each of the entities defined in HTML specifications has an expansion
that is a character reference, such as ä
. So in practical terms,
you might view ä
as a "predefined symbolic name" for
ä
.
That's a bit similar to defining symbolic names for constants
in programming.
The fact that in HTML, both character references and entity references begin with the ampersand symbol is just coincidental, or a pragmatic choice, rather than any deep connection. Arjun Ray has explained, in a Usenet posting, the confusion between character references and entity references as follows, after an explanation of a confusion around the symbol "<":
A similar jump to a wrong conclusion is with entity references and character references, thanks to ERO ("Entity Reference Open") and CRO ("Character Reference Open") sharing the same first character in the RCS bindings (respectively, "&" and "&#").
The reason for the RCS having such overlaps is to minimize the number of ordinary characters that would need to be escaped in some fashion to prevent their interpretation as the start of some markup sequence. It was a reasonable economization. Only it presumed that users would have a clue to what they were doing.
It is a common misconception to think that entity references are
more universal than character references. A notation like ä
looks more system-independent than ä
. But it isn't. It's just
a name for it, so to say.
It is almost always a good idea to terminate a character reference
and an entity reference with a semicolon. According to certain rules,
the semicolon can be omitted in some cases, but it is never incorrect
to use it. And in quite a few cases, it is required. Basically,
the next character is a letter, the semicolon is needed. If you
wrote äh
then a correctly behaving browser would treat it as
a single (and currently undefined) entity reference, not as
the reference ä
followed by the letter h.
In XHTML (and in XML in general), the semicolon is always required. Reference: XML specification, section 4.1 Character and Entity References.
Moreover, Internet Explorer 7 has a bug in parsing entity and character references: without a semicolon, some of them are not recognized but displayed literally, even in a contexts where the semicolon is optional. The bug is possibly an intentional "feature": XHTML restrictions are imposed, even on non-XHTML documents and even though IE 7 does not really understand XHTML! The following table illustrates the scope of the bug.
Type of reference | Example | Display | IE 7 behavior |
---|---|---|---|
Latin 1 entity | é | é | Correct |
Other entity | &mdash | &mdash | Wrong, shown literally |
Decimal | — | — | Correct |
Hexadecimal | — | — | Wrong, shown literally |
On the other hand, especially when URLs with query parts, such as
http://www.server.example/cgi-bin/x.pl?foo=bar©=42
occur in HTML documents, you need to take into account that
an HTML parser (when behaving by the specifications) will recognize anything
that starts with & and a letter as an entity reference, even if
it is not terminated by a semicolon. In the example case,
© is taken as an entity reference that denotes
which
in turn denotes the copyright sign ©, which is not allowed in a URL,
but I digress. To prevent this, replace that
&
#169;&
by &
which is
an entity reference that denotes &
that denotes the & character
as such (i.e. as not to be taken as a constituent of a character
reference or an entity reference).
The "character references" discussed above should not be
confused with the ways in which characters can be denoted in
Cascading Style Sheets, CSS. The
rules for specifying characters in CSS2 include the possibility
of referring to any Unicode character using the notation
"\n "
(note the space!), where n is the hexadecimal code number.
Thus, CSS uses a notation that is quite different from the
HTML (and SGML and XML) notation (&#n;
).
This could be used, for example, in a CSS rule like
a:link:before {content: "\2192 ";}
which suggests that any unvisited link be preceded by
a rightwards arrow. The character denoted by "\2192 "
in
CSS is the same as the character denoted by
→
in HTML; here's that character between
quotation marks, if your browser can display it:
"→". (Hexadecimal 2192 is 8594 in decimal.)
The document at hand is actually served with a style sheet containing such a rule. This part of CSS is ignored by IE 6, correctly supported by Mozilla 1.0, and partly incorrectly supported by Opera 6.03, which applies the rule to visited links, too. (This applies to the Windows versions of those browsers at least.)
For a different presentation of this topic, in a more general
framework, see section
"Escape" notations ("meta notations") for characters
of my character code tutorial.
For a list of enti