“HTML validation” is a good tool, but just a tool

Content:

This is a rant that both promotes validation and puts it down. The point is that if you don’t know what validation really is, it won’t be of much use to you, and could even be waste of time. Validation is simply a way of getting reports about com­ply­ing with some formal rules. What would you do with the results if you don’t know those rules?

This document comments on “HTML validation” as performed by software such as the online services W3C Validator and WDG Validator, and A Real Validator, a product for offline use on Windows. “CSS validation” will be briefly discussed too.

Validation is formal

To be exact, there is no such thing as “HTML validation”. The concept of validation that matters here is SGML or XML validation. SGML is a meta-notation used to define the formal “syntax” rules of various markup notations, or markup “languages”, to use the common misnomer. All that an SGML validator does is that it reads a formalized syntax definition (called DTD, Document Type Definition) and a document that purports to comply with that definition, then reports any mismatches.

Consider the following example of a DTD:

<!ELEMENT list - - (item)+>
<!ELEMENT item - - (person, phone)>
<!ELEMENT person - - CDATA>
<!ELEMENT phone - - CDATA>

It defines a the formal structure of a very simple telephone catalog, consisting of items which contain a name and a number. The internal structure of the names and numbers is not defined here; from the viewpoint of this DTD, they are just character data (CDATA). But the DTD defines that a document (here, a phone catalog) consists of an element called list, which consists of the tags <list> and </list> and a sequence of item elements between them. These elements in turn have a defined structure, and so on. All the end tags are defined as obligatory. Now, suppose we feed the following data to an SGML validator:

<!DOCTYPE list [
<!ELEMENT list - - (item)+>
<!ELEMENT item - - (person, phone)>
<!ELEMENT person - - CDATA>
<!ELEMENT phone - - CDATA> ]>
<list>
<item><person>Liisa</person><phone>313</phone></item>
<item><person>Jukka</person><phone>333</phone>
</list>

You would then get something like the following (the format of reporting the error will vary):

What the error message says is simply that the document does not comply with the formal rules that it claims to conform to, in the particular detail that the end tag </item> has been omitted but the DTD does not allow such omission. The validator gives this message relating to the </list> tag, which is potentially misleading, since there is nothing wrong with that tag. The validator just expected (on the basis of the DTD) another tag before it.

The message also refers to the possibility that the problem was caused by improper nesting. As so often when programs try to be helpful, this is a wrong hint here.

Note that there is absolutely no HTML in the preceding example. Validation is not an HTML matter, except in the simple sense that HTML syntax can be defined in SGML, implying that an HTML document together with a DTD can be validated.

To confuse things a bit, some validators actually think they are always processing HTML documents. This mainly means that the wordings of their error messages might reflect this and could mislead you if you used them for other purposes.

To confuse things more, some validators issue various warnings based on more or less heuristic checks, not DTDs. Useful as such warnings can be, they tend to cause problems in understanding the differences and impacts of validation errors and other messages.

The DOCTYPE confusion

In the example above, the DTD was part of the document being validated. This is usually not practical, so normally the document contains just a reference to a DTD, in the form of a so-called DOCTYPE declaration.

For each official version of HTML, including e.g. the three different sub-versions of HTML 4.01, you will find a corresponding DOCTYPE declaration that you can use. You can just put it into one or two lines at the start of your document, and use a validator to find all the violations of the DTD that there might be in your document.

This would be fine, but great confusion has been caused by “DOCTYPE sniffing”: browsers have started drawing conclusions from the presence or absence of DOCTYPE declarations, even paying great attention to their spelling. This is really bad, though it does not directly affect validation. But for an explanation of the confusion as well as some practical con­clu­sions, see e.g. What happens in Quirks Mode?

So what do I get from validation?

It is generally useful to write HTML in a manner that complies with published HTML spec­i­fi­ca­tions. This gives you much better chances of being cross-browser compatible and making your pages work on future browsers, too. Browsers don’t always implement all that HTML specifications say, but anything beyond the specifications tends to be either very browser-specific or at least inconsistently handled by browsers. And part of complying with a spec­i­fi­ca­tion is complying with the syntax defined by it, the DTD.

Validation helps

Validation is about catching some syntax errors, nothing more. It is useful only if you know the syntax rules so that you can see what’s wrong. Validation does not do any­thing that you would not be able to do on your own. Validation is useful because you, being presumably human, are not particularly good at following formal rules or checking that they are obeyed.

Thus, a validator does not teach you how to write HTML. It just points out mistakes. You are supposed to learn HTML from other sources.

Validation tells that the syntax rules are violated. This is mostly just confusing if you don’t know what those rules are. Thus, to make use of validation, you need to learn the syntax of HTML, in the particular variant you use.

Validation is like taking a paper to a secretary and asking him to check in each and every detail that all the formal rules of the Company Regulations are strictly obeyed, marking all violations, and nothing more. Would you complain when he does exactly that? Yet many people use a validator (since they have been told to), then start accusing the validator for nitpicking or using wrong rules or whatever.

If you really want to understand the formal syntax of HTML and to know how to read validator reports, you need to learn SGML or XML and to read the applicable DTDs. But there is a shortcut that lets you get most of the benefits of validation without too much work: Use the excellent HTML 4.0 Reference by the Web Design Group.

That’s not the ultimate, authoritative reference, and 4.0 slightly differs from 4.01, but for the most of it, this reference tells you rather understandably what the syntax is. The alphabetic list of elements contains links to element descriptions with attribute descriptions as well as sections “Contents” and “Contained in”, where you can find readable explanations of the relevant nesting rules.

Valid documents can be wrong

Validation doesn’t mean that your pages are good in any other sense but the technical sense of not containing violations of the DTD. It is quite possible to have a document that passes validation but is still in definitive violation of HTML specifications. The DTD covers just some aspects, some of the formal aspects. For example, an href attribute value is basically just a string, as far as the DTD is concerned. Any nonsense like href="anything I put here!!¤¤#?" will pass and must pass validation, since the DTD is not violated.

Although there is really not much to be gained from using XHTML on web pages, many people have started using it. Moreover, XHTML has its uses, in conjunction with software that can operate on XHTML but not HTML. Then it becomes relevant that validation means different things for XHTML. The reason is that the metalanguage, XML, is considerably less powerful than SGML. For example, the XML DTD for XHTML 1.0 declares the tabindex attribute as CDATA, which allows virtually anything. In the SGML DTDs of “old” HTML, the attribute is declared as NUMBER. This means that in validating against “old” HTML, tabindex="-1" is reported as an error (as it is), in XHTML validation it passes. On the other hand, XML imposes restrictions that forbid constructs that are formally correct in SGML-based HTML but not actually supported by browsers, such as the shorthand <em/text/ for <em>text</em>, and this means that XHTML validation is pragmatically more useful in some ways.

Thus, a valid HTML document can be incorrect even in a very technical sense of violating the specifications. Naturally it may also trigger browser bugs, lack any useful content, contain absurd (though valid) markup, and so on. And it will be rendered differently by different browsers, of course; that’s how things are meant to work.

Besides, there can be reasons to deviate from published DTDs. Some people think they need to use <embed> markup; I think <nobr> is sometimes useful; and so on. If you decide to use markup that does not belong to the DTD of any published HTML version, you can still make your documents valid in all other aspects. You could even write a DTD of your own (mod­i­fied from a public one), though this will require some fluency in SGML (or XML). For a simple example, see my notes on validating documents with nobr markup. For more information, see the document Creating your own DTD for HTML validation.

Say no to “Valid HTML” icons

The useful purpose of validation is to get error messages and warnings, not a “clean report’, still less to show such a report to others. Unless, of course, those others really pay you for it, or have the power to demand such a report.

The icons add nothing to validity

It’s useful to write valid markup, in most cases. But it’s hardly useful to make a noise about it. By stating or claiming that a document is valid, you do not make it any more valid.

Analogously, it’s useful to use proper punctuation when you write in English. This makes texts somewhat easier to read and understand, and it adds to the literary quality a bit. There are slightly different styles of punctuation, and you should choose one and stick to it. But it’s hardly useful to make a noise thereof.

The icons are cryptic

The majority of web users don’t create web pages. Even if they do, they often do not know much about HTML (or CSS), since they use wysiwyg tools or publishing systems that do not involve “manual” HTML coding.

Thus, they might even fail to understand the abbreviation “HTML,” still less the various version numbers, or they might have just a faint idea of HTML. More importantly, the word “valid” is misunderstood by most people.

Thus, the icon may look like an enigma to a user. Would you like to include an icon like “Checked SGUFDFY 42.5!” onto your pages and expect users to decipher that SGUFDFY 42.5 means some particular convention on punctuation?

No wonder the Help and FAQ for the Markup Validator page contains, as the first entry, an answer to the question “Help me! I clicked on an icon and ended up on this strange site!” Actually, the question is more informative than the answer.

The icons are bad propaganda

People who put “Valid HTML” icons on their pages presumably have a sincere desire to promote validity and the goals behind that. But they forget that for the great majority of users, all those validity stamps are just something between line noise and obscurity. And even if the Web surfer who sees the icon is herself or himself an author, or intends to become one, why would she or he be the least inclined into considering validation just because of the presence of some validity stamps?

Using the icons for propaganda is a bad move. The icons just aren’t understood at all by most visitors, and they are misunderstood by the rest. Validation isn’t really among the most important things that authors should learn. Most Web pages are invalid, but they have much more serious problems too.

Concentrating on validity means concentrating on formal aspects. If authors will try to do so too early, they will get misdirected, and confused. The validators’ error messages can be very confusing, and novice authors might find themselves trying to make poorly designed page valid, instead of rewriting it.

Other “correctness” icons: say no to them, too

It makes more sense to do propaganda for automated checking of CSS style sheet syntax, “CSS validation”, which is much more relevant due to the differences of HTML and CSS and their processing in browsers. However, “Valid CSS!” icons are a poor method for that, even if we ignore the term confusion. (CSS is not SGML or XML based, so “valid” is a misleading form here.) The abbreviation “CSS” is more cryptic than “HTML,” because it has several interpretations, and people actually misunderstand it. A casual visitor may well take “CSS” as standing for Content Scrambling System or Cross-Site Scripting and not Cascading Style sheets.

As regards to declaring compliance to WAI guidelines, it’s yet another thing, and quite different. Such compliance cannot be checked automatically. More­over, practically all pages that claim conformance to WAI guidelines make a false claim thereby. There’s more on this in my Notes on some tools for checking and improving Web page accessibility.

The icons lie to prospective clients

It has been said that prospective clients will look at the validity and other icons. That is, you, as a person who creates Web pages for a living, would miss opportunities since potential customers ignore you if they don’t see the icons. I think this grossly overestimates the emphasis that clients put on validity. Some clients may care about the icons, but far more will see them as any normal visitor does, as distracting enigmas.

So the prospective client who gets impressed by the icons on your pages actually gets impressed because he was seriously misled.

Educated people know that validity claims are often false

Moreover, people who know enough about validation are also learning that many validity claims are just false. Either the author did not actually check validity, or he thought that almost passing validation is enough, or—perhaps most often—the page validated but was edited later and isn’t valid any more. Such things decrease the value of the icons, and pages carrying them, even if the page itself happens to validate.

Making the author’s life easier

People often argue that the icon is useful to the author or maintainer of the page. After making some changes, he can conveniently click on the icon and get a validation report. Note that this means that the icon will be there, proclaiming validity, even though the author does not know yet whether the page is actually valid.

This argument is understandable, but it’s still all wrong. It is wrong the pollute one’s web page with something that distracts from the content and may confuse the user, just for some assumed convenience to the author.

Moreover, it is easier to use other tools, since they can be installed so that they work for any page, without any inserted code on the pages. For example, if you use Interner Explorer, you can easily install “favelets,” which are used via the Favorites menu of the browser. You can find them on Tantek Çelik’s Favelets page, or you can just cut click on the following pseudo-link with the right-hand button of the mouse and select “Add to Favorites”:
javascript:void(document.location='http://validator.w3.org/check?uri='+escape(document.location))

The W3C is just wrong here

The reputation, or assumed reputation, of the validity icons is actually based on the common assumption that the icons are seals of approval by the W3C. Of course, the W3C would be the first to say that they aren’t. But the W3C is factually contributing to the common mis­con­cep­tion, since the W3C promotes the use of the icons, asks (or actually requires) authors to make them links to W3C pages, and includes its logo into the icons.

The W3C has promoted the use of the icons so long that we cannot expect a change of policy. People at the W3C work with web-related specifications and technologies, so it is understandable that they overestimate their meaning as well as users’ ability and willingness to understand validity (or “validity”) issues. But you should know better.

There is no excuse now

If you do not know what the icons really and imply, you should not even think about using them. And if you do know that, as you probably know now, having read this far, you have no excuse for having the icons on your pages. Using them, you would be sending a message that will be either misunderstood or not understood at all by the vast majority of people. And you would know this.

Sorry, now that you have been explicitly told this, you have lost your innocence. You cannot plead ignorance any more. If you keep using the icons, you now know that you are in­ten­tion­al­ly abusing other people’s misconceptions for the purpose of your personal gain, or for your company’s profits, or for the purpose of making some propaganda—maybe for a noble cause, but the propaganda will be ineffective or it will work against the purpose.

The “check mark” is a failure mark to some people

The icons contain a check mark, a v-like symbol. It is widely understood as indicating that something has been checked and approved (or, in other contexts like in checkboxes, that something has been selected). But it is not culturally neutral. The CEN Workshop Agreement European Culturally Specific ICT Requirements explicitly mentions: “Special symbols are often used to mark both correct­ness and an error (and a check mark in one culture can have the opposite meaning in another culture).” In Finland, the “check mark” traditionally indicates disapproval, e.g. in exam papers as a mark that an answer has been checked and found wrong. (The cor­rect­ness mark resembles a percent sign, and could be identified with the Unicode character commercial minus sign.) According to the Unicode Standard 4.0 (p. 160), the situation is the same in “a number of other European countries”. People are getting used to the meaning of a check mark as an approval symbol, due to the effect of information technology, but this means unstability and confusion rather than completed cultural assimilation.

The icon link may fail miserably

The “Valid HTML” icon is required to be a link to the W3C validator (see License and Guidelines for usage of the "valid" icons on the Help and FAQ for the Markup Validator page). When a document passes validation, the validator says:

To show your readers that you have taken the care to create an interoperable Web page, you may display this icon on any page that validates. Here is the HTML you could use to add this icon to your Web page:

(The statement about “interoperable Web page” is very, very wrong. Validity does not guarantee interoperability, of course.)

This is followed by versions of the icon and the HTML markup (in small print in grey text, setting a very bad example) inserting the icon. The markup makes the icon a link, though a style sheet on the W3C pages prevents it from looking like a link, by removing the common default blue border (against the sound principle that Links Want To Be Links). The link has the URL
http://validator.w3.org/check?uri=referer
which means that when the link is followed, the browser is expected to send the address of the page in the Referer header, among the HTTP headers.

And this may simply fail. A browser might be configured not to send the Referer header, or to send a fake header, for privacy reasons. This is not common, but when it happens, the results are confusing. The W3C validator sends back an error message page that is tech­ni­cal­ly correct but understandable to a small fraction of people only.

Sorry! This document can not be checked.

No Referer header found!

You have requested we check the referring page, but your browser did not send the HTTP "Referer" header field. This can be for several reasons, but most commonly it is because your browser does not know about this header, has been configured not to send one, or is behind a proxy or firewall that strips it out of the request before it reaches us.

This is not an error in the referring page!

Please use the form interface on the Validator Home Page to check the page by URL.

Web users do not usually prevent their browser from sending the Referer header. If they do, they can probably understand the error message and take some corrective action. How­ever, some experienced user may have configured the browser of a relative or a friend “to protect privacy” without explaining the implications, or a user might be used a public computer where such “paranoic” settings are in effect. Most web pages do not need the Referer header, so people don’t usually see problems—but when they click on a “Valid HTML” icon, confusion is created.

Moreover, the error message is misleading, since common reasons to the problem include the following (see the document Internet Explorer Does Not Send Referer Header in Unsecured Situations by Microsoft):

Validity icons are unprofessional

Jens Meiert has written the document “Valid CSS” and Similar Claims Are Unprofessional. His argument is that validity is a self-evident requirement. This is basically correct, even though most web sites contains syntax errors (validity errors). It is pointless to claim that your page lacks syntax errors, just as it would be pointless for a surgeon to explicitly claim that his hands are clean.

Some final notes: why do I nitpick about “validation”?

The word “validation” and its relatives are used for a large set of different meanings, even in the Web context. There’s (officially!) a “CSS validator”, and there’s a commercial HTML checker that keeps calling itself “CSE validator” (despite being a checker that performs mixed operations, partly on a rather subjective basis), and “form validation” is commonly used about checking form data by some criteria, etc. So am I defending a lost cause when I say that “validation” and “validator” should be used as relating to SGML or XML validation only?

Maybe it is a lost cause. But we all lose then. Either we fall into confusion, or we use some other verbal expressions that serve the purpose of saying what we mean. That would mostly mean the alternative “SGML or XML validation”. This would be inferior to just using the word “validation”, so that e.g. “checking” would convey the general and vague idea. In particular, you would then choose carefully whether you say “SGML validation” or “XML validation” or perhaps “SGML or XML validation”, with an explanation of the difference between the two, which is not difficult except if you want to tell the truth.

Besides, in the XML context, it is essential to distinguish between two kinds of formal correctness, “well-formed” and “valid”. This makes it increasingly important to understand and to remember that in the markup context, the word “valid” and its relatives are technical terms, not to be confused with correctness in general.

But what about HTML5?

HTML5 is now a Candidate Recommendation from the W3C. HTML5 RC differs from the WHATWG “ HTML Living Standard. The latter is often called HTML5, too, and originally the two documents were close to each other, but they are diverging, and there is no public document on the differences now.

HTML5 is neither SGML-based nor XML-based, though allows a syntactic form that conforms to XML rules and is sometimes called XHTML5. HTML5 has its own syntax rules described in prose, which is rather technical and even formal, but not DTD-based. It is also rather impractical to try to defined a DTD for it, as my attempt at an HTML5 DTD shows: too many limitations, too tricky, and the DTD would need to be so permissive that many syntax errors would go uncaught.

So no validation for HTML5. Yet, both validator.nu and validator.w3.org perform “HTML5 validation” (in addition to validating HTML). The point is that these programs, which are closely interconnected but differ in a manner that has not been publicly documented, are really polymorphic: when a document is submitted to them as HTML5, they switch to a completely different mode of operation. This means checking the syntax by HTML5 rules, quite independently of any DTD.

There is no published document of the specific variant(s) of HTML5 that they check against. But it seems that they try use the HTML 5.1 Nightly as the basis but are not really synchronized with it.

This means that a document may get the congratulations “This document was successfully checked as HTML5!” today and the message “Errors found while checking this document as HTML5!” tomorrow, just because someone changed the rules.