This document proposes a multipurpose universal format for text documents. The format allows the document structure to be described using natural-looking markup, which has defined semantics in terms of logical meanings, not visual appearance. UTD is intended for a wide range of applications, including online and offline publishing, archival, and content combination and selection.
Content:
text/utd
doc
doc
elementsThis document proposes a general-purpose text data format called UTD. It might be seen as an idea of how HTML should have been designed, or how a successor of HTML could be designed, but it is intended to be useful quite independently of the WWW, too. The design has been inspired by the fundamental ideas of SGML as well as the "general" document type (presented as an example in the SGML standard) in particular but also by many other considerations, some of which have been summarized in my old review of the HTML 4.0 draft and in my newer HTML in retrospect - what can we learn from the great success, and the great failure?
The design tries to avoid the theoretical burden of the abstractness of SGML as well as the tag soup tradition coupled with HTML. Briefly, UTD is not a proposal to a new version of HTML but a proposal to a completely distinct, yet conceptually related, data format, which could coexist with other formats and gradually gain importance. In the first phase, it could be used as the internal format of documents that are then programmatically converted to HTML or XML documents to be served with some style sheets.
See also the afterword.
Below, we have a simple document in UTD format. It is a short note on an observation, perhaps to be sent just as an E-mail message to a few friends, but perhaps also to be sent to a Usenet-like system (where UTD format is accepted) or to be put onto a Web page as part of a longer list, or even included into a data base. In the latter case, we would probably use some additional markup that corresponds to special needs of the data base and its management.
Content-Type: text/utd Expires: Sun, 18 Mar 2022 08:30:00 GMT {Doc(language:en) {Title Albino eagle found dead in {translate(language:fi,sv:Esbo) Espoo}, Finland, {time 2022-03-18}} {Meta {Audience Ornithologists} {Master(http://info.foo.example/path/albin.utd)}} I found a dead eagle ({taxon Aquila chrysaëtos}) on my backyard on {time(std:2022-03-18T08+02) March 18th 8 AM}, in {translate(language:fi,sv:Olars) Olari}, Espoo, Finland. It looks {Emphatic completely} albinistic. It has no visible injuries. I'll contact the local museum of natural history as soon as possible. {Author {name(language:fi) Jukka K. Korpela}, {email jukkakk@gmail.com}, {phone(kind:home) +358 50 5500 168}} }
The first three lines illustrate that a UTD document may contain Internet message headers at the start of it. This could be much more flexible than e.g. trying to find out how to make an HTTP server (Web server) send the headers you want, when you need them. Since the author expects to update the document very soon and it is very short, it is rational to try to prevent caching.
The document proper is enclosed between {Doc
and }
, making it a Doc
element. That element could be used as a building
block of other elements, e.g. for a list of observations.
Inside the element we have a Title
element,
a meta
element, some textual content (which contains
some "text-level" markup), and an Author
element.
The Title
element specifies an "external" title
for the document. In this case, since it is not inside
a Meta
element
it is also displayed as a main heading for the document.
The Meta
element has two subelements and it specifies
"metadata only", i.e. information about the document; such information
need not and normally should not be presented as normal content but
rather as optionally accessible information, or at least in a way
which makes it obvious that it's "documentation about the document"
rather than part of the document itself. There the Audience
element tells the intended audience, in normal prose, and it might
be very useful in helping the reader see at a glance
that the document is probably not interesting to him.
The Master
element specifies the address (URL) where
the master copy of the document resides; if, for example, the
message were sent as an E-mail message, this element would allow
convenient access to the current version - something to
be appreciated if we can expect the content to be updated soon!
The content proper is just two paragraphs of text. The empty
line implicitly defines that there are two paragraph elements present.
You could use explicit markup for this, if you like.
Inside the content, some words and phrases are specially marked up.
The taxon
markup indicates that its content is a
scientific name of a species. This should have an effect on
presentation, but it could also be used in various other ways e.g.
in automatic indexing and searching. It's better to avoid having
to guess whether "Aquila" is part of such a name, or
a person's name, or a ship's name, or something, especially if you
are interested in (say) automatic processing of biological information.
The time
markup indicates its content as a designation
of time; here it also specifies the standardized
(ISO 8601) designation for that time,
giving unambiguous information which is well-suited to
automatic processing, still letting the author use plain English
in the visible content. The translate
markup anticipates
the possibility of automatic translation of the text into
Swedish. Simple notes like this are fairly suitable to modern
automatic translation, but a translating program might need
some help with some less common proper names, so the Swedish
equivalents are given explicitly. And, of course, a human
translator might benefit from such information, too; he could peek
at the UTD markup if needed.
There's also a word marked up as Emphatic
.
It indicates that it is important in its local context,
to be displayed or uttered prominently to emphasize this.
UTD has various elements for indicating emphasis.
If you wanted to emphasize the complete albinism itself,
in a headline-like manner, you would use something like
{Important It looks {Emphatic completely} albinistic.}
and a browser could show it e.g. in a colored box.
Then there's the Author
element.
It's meaning should be pretty obvious, but note what possibilities
we have if documents generally use such markup.
For example, for a set of documents, you could easily produce
a list of documents by author, if you just have a program that
parses UTD and recognizes author
elements.
(Admittedly,
there are risks, like making the collection of E-mail addresses
for spamming a little easier, but they already get the addresses
without any markup.)
There's also some language
markup.
The overall language of
the document is indicated in an attribute of the Doc
element. Standardized ISO 639 language codes are used, and
en
stands for English, as you guessed. Some names
are indicated as being in Finnish. This implies, for example,
that if an automatic translation program shows you the message in
German, it should keep those strings intact (even if they happened
to look like English words!), as a rule. It also means that if you
are listening to the document using a speech-generation program,
that program should (if it is good) either use its information
about the Finnish language or at least give some kind of a
warning that what you hear might not be the intended pronunciation.
Or it could spell the names for you, letter by letter, if you asked
for it, as you well might.
If you would often write messages like this, you would probably have part of their common structure as a "boilerplate", or a program that generates suitable format, so that you would mostly type in the textual content. But when casually writing a message like this, it would not be too difficult to type it "by hand".
A browser could display the document in a multitude of ways, but one possibility is something like the following:
I found a dead eagle (Aquila chrysaëtos) on my backyard on March 18th 8 AM, in Olari, Espoo, Finland.
It looks completely albinistic. It has no visible injuries. I'll contact the local museum of natural history as soon as possible.
Jukka K. Korpela,
jukkakk@gmail.com
,
+358 9 888 2675
If you use a graphic browser (such as IE 4 or newer) that
supports the title
attribute in HTML, try moving
the mouse around to see some information in "tooltips". This
illustrates some simple things that a UTD browser might do with
information specified in UTD.
On the other hand, in a simple pure-text presentation, with omissible information omitted, such as something that you might want to get into a portable phone or an Internet-connected wristwatch, it might look like the following:
Albino eagle found dead in Espoo, Finland, 2022-03-18
I found a dead eagle (/Aquila chrysaëtos/) on my backyard on March 18th 8 AM, Olari, Espoo, Finland.
It looks *completely* albinistic. It has no visible injuries. I'll contact the local museum of natural history as soon as possible.
Author: Jukka K. Korpela, jukkakk@gmail.com, +358 9 888 2675
But maybe the software that produces the textual presentation for a phone could automatically pick up the phone address, as indicated in the markup, and make it possible to the recipient to use that number directly. There's a wide world of possibilities, as soon as you have nice markup to base some actions on, just because the markup itself does not specify actions but logical meanings and structures.
UTD is intended to become a general-purpose text data storage and transfer format which carries basic structural information and can be converted to various presentation (and other) formats and processed using automatic tools, such as indexing software, automatic translation, and other transformations. It should be easy, though perhaps sometimes tedious, to write UTD "by hand", but it could also be generated by automatic conversion tools from other formats or by UTD capable text editing software. In any case, the markup should be simple and intuitive.
The design of UTD tries to cover multiple processing of a document, in the sense that a document can be programmatically processed in very different ways during preparation, distribution, maintenance, and use. Consider, for example, that you are writing a textbook for a programming language. You would want to typeset it nicely, but you might also want to have it readable on screen, perhaps as embedded into some "standard" help system on a computer, preferably with some nice search facilities. You might know how important a good index is to a serious reader of a book, and you might also know hard it is to produce one, so you might appreciate any markup system that lets you just say "put this word into an index (in this form)", leaving the actual generation of an index (with correct page numbers) to a program. You would also want to use a spelling checker, and one that does not complain about language keywords and function names if they have been taken from English and your textbook is in German. If the book becomes a success, you might appreciate the possibility of automatically translating it into different languages, and technical prose is fairly translatable that way, if we can assist it a bit using markup that e.g. tells that in the phrase "the log function" the word "log" is a name that shall not be translated at all. You might also want to have your program examples syntactically checked, perhaps executed in a test environment, and saved as separate files. An editor could do such things, if the examples have been properly marked up. This does not mean that you all such functionality should be built into a single, huge program; instead, data could be passed between applications so that e.g. an editor just recognizes what parts are program code and passes them to some software that can handle it.
Although UTD would be used in systems like the World Wide Web, it is intended to be much more (though not less) than a publication language for the WWW. It could be used as an archiving format, intended to save the essential structure and content of a document for the future, as opposite to various program-specific, presentation-oriented formats that might very soon become obsolete and unreadable even on the next version of the program in which they are written. It could also be used in simple (or complex) data storage and retrieval systems, like a personal mailbox, to the extent that E-mail message are in UTD format. And, for example, a company could specify that UTD be used in its internal E-mail system and data sharing, with additional local and application-oriented restrictions imposed (e.g., requirements that some specific markup be used), to facilitate more efficient processing of incoming E-mail. Wouldn't it be nice to a busy boss if she could have her E-mail automagically processed so that she can quickly see which messages contain a proposal on something, and have the proposals extracted first?
The repertoire of inline (text-level) markup should be rich enough to allow documents to be written in a manner which can be useful in at least some forms of automatic processing. Most inline elements would not necessarily require any particular effect on presentation. For example, it should be possible to indicate a word or phrase as a person's name or other proper name, as some sort of code (e.g. an E-mail address), or a "literal" which means the word itself (e.g. as in "the plural of 'ox' is 'oxen'"). Such markup may greatly improve the quality of automatic translations and the efficiency of searches when the keywords may appear either a proper names or as normal words. (Consider trying to use Web search engines for finding information about a person with surname "English" nowadays.)
Opposite to the XML trend, the design of UTD aims at generic
and general-purpose markup, with semantically defined concepts such
as "paragraph", "heading", "definition", and "record". The format is
however open-ended in the sense that document-specific element types
can be used, with semantic descriptions given in prose, and collections
(packages) of such types can be formed. They would, however, be
something to be used in addition to general markup, often
so that a general markup element's content is further refined
using specialized markup. For example, in general markup
one can write {money USD 100}
or {money 100}
when desired, to
indicate that the string USD 100
or the string 100
is an expression of a sum of money. More specialized markup systems
could be developed, and perhaps standardized, to specify exactly
the internal structure and meaning of money denotations, e.g.
formally specifying the default currency, so that
{money 100}
becomes semantically unambiguous.
This might result in the possibility of using an automatic
currency converter integrated into a browser: it could recognize
the markup and e.g. display the corresponding sum in euros
in parentheses right after
the dollar-valued sum.
Our sample markup, {money USD 100}
instead of just USD 100
,
is not something that an author is required to use.
The UTD format
allows fairly rich and detailed markup, but it is up to
the author to decide how much markup is used.
On the other hand, in automatic conversions from, say, some data
base format to UTD, there is generally no reason not
to "tag" (mark) e.g. a string as a date denotation when it is
extracted from a date field in a database.
Rich markup will
probably turn out to be useful for various purposes, like
maintenance of documents, automatic and computer-assisted translation, etc.
For example, if your document contains an intentionally misspelled
word, perhaps because the document discusses orthography, you would
probably want to use markup like
{misspelled definishon}
to avoid getting error messages
whenever you process the document with an editor with automatic
spelling checking.
Although UTD has a rich repertoire of available markup, the author can thus make use of just a small part thereof. You could first write very basic markup into a document, then add markup, as you find some practical reasons to do so. Some day you might learn how useful definition markup can be to users, and you would slap suitable markup around your definitions, so to say. Note that this is contrary to the SGML principles. The tutorial annex A of The SGML Handbook says (on p. 7):
Generalized markup is based on two novel postulates:
- Markup should describe a document's structure and other attributes rather than specify processing to be performed on it, as descriptive markup need be done only once and will suffice for all future processing.
- Markup should be rigorous so that the techniques available for processing rigorously-defined objects like programs and data bases can be used for processing documents as well.
UTD largely shares these "postulates" but is specifically not oriented towards the idea that "markup need be done only once and will suffice for all future processing". On the contrary, markup can be cumulated as needed.
This also makes incremental learning possible. In fact, you could even start from using no markup and learn just one element at a time, though in practice you'd probably benefit from learning a small "starter kit" first, then some additional sets of elements as "packages".
Moreover, even if rich markup is used, browsers and other software for processing UTD can be very simple. In fact, we might even call a trivial program a UTD browser if it just displays an UTD file as such, as if it were plain text, and more advanced browsers could be built piecewise or to perform fairly specialized jobs where most markup is either ignored or displayed as such.
The generality of UTD does not exclude the possibility of defining specialized usages of UTD for particular document types and purposes. On the contrary, UTD is aimed at making "customized markup systems" easy to define, yet and useful outside their specialized usage, too. For example, an organization that needs to produce large amounts of documents that need to comply with strict structural rules could decide that UTD markup be used for it, with specific rules on which parts of UTD must be used (or can be used) and how. Such documents, although perhaps very specific in structure, would nevertheless be processable using general UTD software.
The UTD format is strongly hierarchic and nestable.
There are several markup elements that are primarily intended
for use at the topmost level of nesting, such as
{Author}
and
{Abstract}
for specifying the author or an abstract of the
entire document, but could be used for parts of a document as well.
For example, a section could have its own abstract, and for specifying
the original author of some quoted text you would use
{Author}
inside markup that indicates a quotation.
The rules have been designed so that you can take an UTD document
and make it part of another UTD document with no editing.
When elements are nested, the meaning of an inner element is interpreted as relative to the meaning of the outer element. An emphasized word in a heading is something that is even more important than the heading as a whole. This does not mean that rendering methods would necessarily cumulate. For example, a browser that uses italics both for headings and emphasized words is not expected to use "doubly italic" text but some other method of indicating the emphasized word as emphatic in its environment. This might even mean not using italics for it at all! (Normal upright style looks emphatic inside italics text.)
A UTD document can be presented visually on screen, on paper, or other media. A system that is capable of doing so is called a browser. Note that UTD documents can be presented aurally too, and some considerations on visual presentation apply there as well.
Web browsers typically operate so that only a part of a document is visible at any given time, and the user can select which part is shown, e.g. by keyboard commands, or by scrolling, by following internal links, or by invoking a "Search" function. Browsers capable of handling UTD are expected to have such functionality, and often more. A browser that recognizes a large table inside the document could make just the header line and a few first data lines visible, in a separate scrollable area inside the window. An author could make suggestions on such presentation in a style sheet. Moreover, thanks to UTD being structured, a browser could have some basic tools like "move to next section", which can be significantly more comfortable than primitive scrolling.
In general, no specific presentation is required, and the presentation can
vary greatly. However, a conforming browser is required to
apply some general principles. Some UTD markup is
ignorable in presentation: a browser may act as if
the markup were not there, presenting content only.
For example, a browser may present
{money 100}
as if the document contained just
100
, and this is what most browsers probably will.
But a browser may pay attention to such markup.
In particular, a browser could be configured, either
permanently or for a particular browsing situation, to display
some elements in a specific way, e.g. the content of
all {money ...}
highlighted in some color.
One of the ideas of generic semantic markup is that users can tune
the presentation to suit their needs; if the need to find prices,
they can set the browser highlight money denotations. Of course,
this works only if the relevant markup is actually used.
The user should know if that is not the case, so that he knows that
the document may contain prices even if they are not highlighted.
Thus, an author should use markup consistently, e.g. either use
no {money ...}
markup in a document, or use it
for all money denotations.
(This needs elaboration! Should there be declarations on this,
or should this be implicit? What about documents created by
combining documents?)
On the other hand, some markup is
forcing
in the
sense that a browser shall present the document differently
depending on the presence of the markup and do that in a manner
that corresponds to the general meaning of the markup.
For example, the construct {Heading Hello}
must be displayed as somehow different from plain
Hello
, and the difference should convey the idea that
the text is a heading. The actual presentation may vary greatly
and might (and normally should) reflect the level of nesting,
i.e. headings of different levels should be presented differently
as far as feasible.
Even forcing markup can be handled by the browser simply
by presenting the document content
with the markup displayed as such. That is, literally displaying
e.g. {Heading Hello}
is acceptable; a browser could
use a colon in this context to make things look slightly better,
e.g. displaying
{Warning Do not open attachments!}
as
{Warning: Do not open attachments!}
.
Browsers should
however make their best effort to use more natural presentations.
In UTD, markup is forcing if and only if the element name begins with a capital letter. Thus, a browser is allowed to ignore any markup with an element name beginning with a lower case letter, rendering its content only. When a capital initial is used, a browser is required to process the markup according to its definition, the minimalistic behavior being (for most elements) the literal display of the markup itself.
This means that when the format will be extended, we just select lower case initial for markup that is "optional extra" and upper case initial when older browsers should display the markup literally.
For several elements, we might almost as well have defined them
as non-forcing, but making them forcing helps authors in avoiding
verbal explanations that would be unnecessary in several
browsing situations. Since e.g. {Note}
is forcing,
an author need not, and should not, start its content with "Note:" or
"Bemerkung" or anything like that but rely on browsers inserting
something like of they can't do any better. And since
{toc}
is not forcing, an author should write a
suitable heading if needed or desired.
Browsers are required to honor nesting of
forcing markup.
Thus, for example, an {em}
element
inside a heading must be shown as different from other text inside
a heading. In practice, browsers will probably have separate handling
for just a few common ways to nest forcing markup, and anything exotic
will assumably be treated using the "literal fallback" mentioned above.
The half-humoristic choice of the term "forcing"
reflects the frequency of questions of the type
"How do I force (a browser to...)", to which the correct answer
is "you don't", in discussions about HTML authoring.
There were a few "forcing" elements in HTML 2.0, such as
em
('emphasis')
and strong
('strong emphasis'), about which the
specification said that they shall be presented as different from
normal text and from each other. This has been violated by a few
browsers, and the statement has been omitted from newer specifications.
Contrary to such trends, UTD defines quite a lot of markup as
"forcing", though only for good reasons. Of course, "forcing" does
not mean forcing a browser to use any particular
presentation, such as a particular font, color, or indentation.
Neither does the it prevent the possibility that a user of browser
configures it to ignore "forcing" markup; the user is expected to be
able to do such things, and to know what he is doing.
UTD has a reference representation format which consists of lines
of text containing markup somewhat similar to HTML.
That representation uses one of the
alternative formats of explicit markup,
namely a format with unabridged element and attribute names, e.g.
{person(language:fi) Jukka Korpela}
where the braces { and } delimit an element containing
an element name (here person
),
an optional attribute list in parentheses, and
element content, which is just a plain string in this case.
The attributes are specified as name:
value
pairs.
(For some elements, there is an attribute with a default name, so
the value could be used alone then.)
The example would correspond to the following XML markup
syntactically:
<person language="fi">Jukka Korpela</person>
The difference between the syntactic formats is of course relatively
small in a sense, but the UTD format, in addition to being
slightly more compact and natural-looking, avoids the somewhat unnatural
"end tags" and the widespread confusion between tags and elements.
But the implied "real" UTD format is an abstract document tree. For example, a document may consist of a sequence of sections, each of which consists of a heading and one or more paragraphs, each of which has plain text as the "leaves" of a tree. The order of the leaves is significant, in the general case. Although the format is specified in terms of the reference representation here, this is just a convenience, and software that performs non-trivial processing of UTD documents is expected to operate on the abstract document tree. (This should be evident for any processing that resembles the display of an HTML document under the influence of a CSS style sheet, or the manipulation of a Web page with JavaScript code.)
A "dummy" element with no element name or attributes is
allowed, but the opening brace must be followed by a space then, e.g.
{ Hello world}
,
unless the element is completely empty: {}
.
Effectively it acts just as
a syntactic grouper, usually to be used to avoid having words
interpreted as separate elements. For example, the
list
{List Hello world stop}
contains the three items Hello
, world
, and
stop
, whereas the list
{List { Hello world} stop}
contains the two items Hello world
and
stop
.
A "dummy" element can also be used to associate some attributes
with some data, e.g. { (base:ox) oxen}
. This
corresponds to span
and div
in HTML but
makes the semantic emptiness more obvious.
The UTD format has plain Ascii text as a special case. That is, any file containing just US-Ascii text, with no headers or markup, is technically a UTD document. In that case, the implied document structure is trivial: it is a tree with one leaf only, a (possibly very bulky) string of characters. (It could contain substrings that are markup in the UTD format when enabled, but they are treated as plain characters in such a case.)
There's just one technical restriction: such a plain text must not begin with a line beginning with a string of alphanumeric characters followed by a colon. The reason for the restriction is that this way we can distinguish plain text from Internet messages such as E-mail messages, some of which are also a special case of the UTD format! The restriction is fairly strong, to make sure that any data that begins with an Internet message header is taken as such a message (with some recognizable internal format), not as plain text. Thus, if your plain text file would accidentally start with such a line, you could e.g. add an empty line before it.
Consequently, you could, assuming that some minimal support to UTD
will exist on the WWW, start even with a plain text file, declare
it (in HTTP headers, typically to be generated
by a server on the basis of some simple rules that map file name extensions
to Internet media types)
to be text/utd
, and add UTD markup to it as desired. The
practical point
is that the address (URL) of the document could remain the same, so that
other documents, bookmarks etc. that refer to it need not be changed
when you decide to move to UTD format proper.
The interpretation of a UTD document, as a mapping from a string of characters to an abstract structure, depends on the parsing mode.
The default parsing mode is the following:
{Paragraph}
element.
{Doc}
and
a heading for it. More specifically,
the first of such constructs constitutes
a {Heading}
element for the entire document; otherwise
such a construct closes any open 2nd level
{Doc}
element and opens a new one and constitutes
a {Heading}
element for it.
This means that the first "isolated line" is the overall heading for
an element, and other "isolated lines" are 2nd level headings, implicitly
defining a "section" structure.
\{
is taken as indicating the
left brace { itself as a data character.
{
immediately followed by a letter of the English alphabet
is taken as constituting the start of explicit markup, processed
according to separate rules.
There are several formats of explicit markup, intended for use in different practical situations. They are best illustrated by examples:
some text {person Jukka Korpela} some text some text {p Jukka Korpela} some text some text {person(language:fi) Jukka Korpela} some text some text {person( language: fi, reference: http://jkorpela.fi/personal.html, reference-content-language: en-US, identifier: signature) Jukka Korpela} some text some text {p( lang: fi, ref: http://jkorpela.fi/personal.html, ref-lang: en-US, id: signature) Jukka Korpela} some text
Note that the format where each attribute appears on a line of its own is suitable for elements with several longish attributes, and it deviates from the more concise format only as regards to using whitespace.
An attribute value may contain any characters except a comma, a left parenthesis, or a backslash. If such characters should appear in a value, they must be immediately preceded by a backslash \.
All of these map to the same abstract structure fundamentally,
containing a {person}
element (between chunks of text),
though partly with different attributes
attached to it. Note that the last two examples are completely
equivalent; their difference is just that the latter uses more
compact names for the element and the attributes.
This is intended to be practical to read and write as well as simple enough for quick parsing. Although UTD is expected to be generated using various tools, it should always be possible to write, read, and edit it "by hand". Simple, quick parsing is relevant, since it is expected that bulky amounts of UTD format data will be processed by various programs that extract just the textual content (e.g., for indexing or searching purposes) or pay attention just to some markup.
Explicit markup always begins with a left brace
followed by an element name.
The name may contain letters of the English alphabet, digits,
hyphens, and underlines, and it is case sensitive, e.g.
{super}
is distinct from {Super}
(as needed for distinguishing "forcing markup").
If the name is immediately followed by a colon, then the markup structure
extends to the end of the line. Otherwise it extends to the next matching
right brace, which can optionally be followed by the word end followed
by the element name in parentheses.
("Matching" is hopefully an intuitively clear
concept here, relating to the possibility of nesting markup structures.
In a formal specification, it will have to be explained of course.
Question: do we really need the optional, comment-like repetition
of the tag name?)
The attributes can be given in a few different ways which are hopefully
obvious from the examples.
The use of colon rather than equals sign is intended to make the markup look more natural. But should the punctuation be entirely omitted?
Somewhat analogously with the principle "use as much markup as you like
or expect to be useful", browsers are allowed to have genuine
implementation for just some parts of UTD markup, showing "raw UTD"
for some forcing markup when needed. The most typical example is
the use of subscripts and superscripts: a browser that cannot
display them properly, e.g. because it operates on a simple display
device ("character cell browser"), is allowed to display e.g.
{Super 2}
literally. A browser could alternatively
display similar notation with UTD element names replaced by expressions
in a suitable language, provided that this is made consistently
and without causing ambiguity.
When a browsers displays "raw UTD" that way, it should make a reasonable effort to indicate that the displayed data is to be taken some special way. If possible, monospace font should be used, reflecting the code-like nature of the text. Moreover, that text should appear otherwise as otherwise special too, e.g. using a slightly different background color or a thin border around it.
In the extreme, a browser could use such a method for all forcing markup, or even for all markup since the method is allowed for all markup (though recommended only as the last resort). This means that ultimately even a simple program that displays the content literally is a "UTD browser"!
There are, however, two exceptions to the
rule: the {Revealable}
markup
must be recognized and honored,
and so must {Seq}
.
This implies that a program needs to
have at least a simple UTD parser.
Despite the richness of markup that you can use in UTD as defined here, it is to be expected that new markup will be added, either in new versions or (often as a preliminary to that) into variants (dialects) of UTD for special usage.
There is a lot of talk about extensibility in the XML framework. It's mostly misleading, since real extensibility is much more than just a syntax issue. Being SGML based, XML isn't really flexibly extensible. UTD, on the other hand, can be designed differently.
The old de facto principle in HTML is that browsers should
ignore tags and attributes that they do not recognize. This means that
upon seeing <foo>
zap</foo>
an HTML processor ignores the tags
<foo>
and </foo>
,
processing just the content between them. This has worked to some
extent when HTML has been extended.
Technically, UTD has no tags in the same sense as HTML, but the
same principle can easily be formulated in terms of elements,
of course.
But the principle does not always apply; we might wish to add
forcing markup too.
We have chosen the principle of making the case of the first letter in an element name significant. This may sound arbitrary and primitive, but it is not counter-intuitive (capital initial often indicates emphasis or importance), and it's good for simple extensibility. Normally new elements should be added so that they are not forcing, but if needed, they can be made forcing, relying on older browsers using the simple fallback of displaying the markup. Naturally, especially the names of forcing markup elements should be chosen carefully, to make them understandable to human readers too.
There is still a problem: markup extensions could be defined
independently of each other, and there might be clashes, so that
an element name has different meanings in different extensions.
A program that recognizes extended markup would get things all
wrong in such cases.
Therefore, any document that uses extended markup must
identify the extensions used.
For this purposes, the {extensions}
element
is defined
Its value is taken as a string that identifies a set
of extensions. A centralized registry of such strings should be
established. The element may additionally contain Ref
elements that will be taken as pointing to documents that describe
the set of extensions (syntactically and semantically).
In addition to the simple rule that any US-ASCII plain text can be
regarded as being in UTD format, a text document in any encoding
can be made to comply formally with UTD by inserting
Content-Type: text/plain; charset=...
at its very beginning, optionally followed by other Internet message
headers and UTD specific headers. If such headers are omitted,
the separating empty lines are still required. Thus, you minimally
need to insert the Content-Type
header and two empty lines.
A UTD document proper (i.e., if it is not a plain text file) may
start with a block
of Internet message headers, followed by an empty line before the
actual content. Some of such headers
(e.g., Content-Type
)
specify characteristics of
the data representation, such as character encoding; some of them
(e.g., Keywords
)
serve purposes similar to UTD markup in a {meta}
element
and can in fact be mapped to such elements;
and some of them
(e.g., Cache-control
)
are specific to the use of UTD documents in a Web
environment.
This approach allows very simple processing of UTD documents by HTTP servers (such as Web servers) in the following sense: a server, when requested to send a document that happens to be a UTD document, could just read the first lines up to the first empty line, merge them with any default HTTP headers it has been configured to send and then send the headers and the rest of the document as body. This would let authors make servers send document-specific HTTP headers (such as headers related to cache control, content negotiation, etc.) in a very natural way, with no separate configuration files. (Servers could still filter out some of the headers, if desired.)
The question arises what happens when there is a conflict
between such headers and other information, namely the HTTP headers
sent by a server or the UTD markup in the document. This is resolved
by giving absolute preference to HTTP headers and by giving
preference to UTD markup over the header lines in the document itself.
For example, an UTD document could contain a Subject
header line, and it would define the default
"citation title" for the document, but this default can be overridden
by an explicit title
element. The reason for this is
that it lets you take an Internet message, such as an E-mail message,
and declare it as an UTD document, then start adding UTD markup into it
if desired, without bothering issues like the exact meanings of
Internet message headers or editing or removing them; instead, you would
just write suitable UTD markup and expect it to override any message
headers that would conflict with it.
A program that receives a UTD document
via Internet protocols such as the E-mail protocol or HTTP
and stores it locally
is expected to store it
with the headers.
If it is a Web page, the program should
add a Content-Location
header if not already present. This would solve
the problem that a locally saved document cannot be properly viewed
since relative URLs don't work, the character encoding is unknown, etc.
Note that an E-mail message or a Usenet article as such could be treated
as a UTD document, if its content is plain text (or text/utd
).
Thus, to publish such material on the Web, you
could first simply put it onto a Web server and indicate it as
being text/utd (in a server-specific manner, typically via an
association of a file name suffix with a media type), later, if
desired, edit it by adding markup (after changing the content type
of the content
to text/utd
if needed).
The headers, if present, must be in the general format defined in RFC 2822 (the successor of RFC 822).
text/utd
The proposed media type
text/utd
has optional parameters
version
and charset
. Thus, it could be used
in Internet message headers as follows:
Content-Type: text/utd; version=...; charset=...
with version
defaulted to 1
and charset
to ISO-8859-1
. General-purpose
software for processing UTD shall be able to process US-ASCII,
ISO-8859-1, UTF-8, and ISO-10646-UCS-2 at least. This does not mean that it
would need to be able to display all the characters representable in
those encodings; basically it just needs to process the encoding itself.
(Note: Special rules are needed to allow the entire data set to be
Unicode encoded.)
The following is just a tentative idea, and perhaps
an unnecessary complication:
There is a special rule to facilitate the use of rich character
repertoires within the framework of 8-bit encodings
(such as ISO-8859-1): The markup
{Offset(base) content}
,
where base is a hexadecimal number,
specifies that within content,
any octet (8-bit byte) n except 7B hexadecimal
(the code for }
in Ascii)
is to be interpreted as referring to the Unicode character
base+n.
This is remotely analogous with "code page switching" but means
switching between various 256 character chunks in Unicode.
For example,
{Offset(0370) abc}
means the sequence of the
Greek letters alpha, beta, and gamma. This way, you could write
text in Greek, Cyrillic, and other non-Latin alphabets using
a "normal" keyboard and looking at the equivalents from a table.
This is somewhat awkward, but it would make it easier to insert
small pieces of texts in non-Latin alphabets.
Some Internet message headers, when present in a UTD document, are treated as equivalent to some UTD markup. This feature is present mainly because it allows smoother transition from Internet messages to UTD format proper.
Header | Markup |
---|---|
Content-Class | Document-class |
Content-Language | language |
Content-Location | Master |
Keywords | Keywords |
Subject | Title and
heading |
Summary | Abstract |
Note: A Subject
header specifies the default content for
both Title
and
Heading
. It can be overridden by explicit
Title
and
Heading
elements.
Motto: Hypertext should be more than text, not less.
If you start with a plain ASCII text document and wish to convert
it to UTD,
you could add the line
Content-Type: text/utd; markup=0
and an empty line at the start. The parameter
markup=0
specifies that
no explicit markup is used and the left brace character is to be taken
as such. You should then check that the intended structure of the
document corresponds to the default parsing rules.
To add more structure, you would remove the markup=0
parameter and
preceded any eventual occurrence of the left brace character { by
a backslash \. Then you can add explicit markup as desired.
Note that implicit markup rules still apply.
If there is something in the document that should be
kept "as is",
without applying the default parsing rules, as plain text, use
{Text}
markup.
Typically, you would put
{Text
before a sequence of lines and
}
after it. Note that explicit markup is recognized inside such markup.
To prevent that, use the plain
attribute, which takes no value:
{Text(plain) ...}
.
The block of text inside a Text
element is treated
as a sequence of characters and line breaks (effectively treating
the latter as "control characters") to be displays as such with
respect to the use of spaces and line breaks. Thus, it roughly
corresponds to
the pre
element in HTML.
The {Text}
content need not and normally
should not be presented in monospace font, and the effect of
using horizontal tab characters is strictly undefined. For C source
code, for example,
you would thus probably want to use both {text}
and {code(notation:c)}
, nested either way.)
Otherwise in UTD, any sequence of characters
in the textual content of a document is treated as reformattable, in
order to fit different presentation needs such as varying canvas width.
(The exact rules for reformatting as regards to line breaks are to be
defined, hopefully in a manner which is simpler and more practical
than the Unicode line breaking rules. Note the possibility of
using explicit no-break space and other special characters.)
The {Message}
element has an Internet message
as its content. Logically, this implies that the content is quoted
in some sense, and thus no explicit {quote}
markup is needed.
The content is expected to contain
message headers followed by an empty line
followed by a body which will normally be treated as plain text and
as if there were {Text(plain) ...}
markup around it.
Irrespectively of the character encoding of the document,
any ISO 10646 character can be entered (as a data character
or in an attribute value) as
{char(xxxx)}
where xxxx is the code position in ISO 10646
(and in Unicode), in
hexadecimal. The element name char
can be abbreviated
as c
.
The element may have content, which then
specifies the surrogate to be used
in place of the character when
the character itself cannot be presented,
due to font restrictions or other reasons. Example:
{char(2014)-}
Typically, the surrogate is a string of US-Ascii characters,
but it could be something more complicated, even involving
character references, perhaps with their own "fallback" contents, etc.
In particular, the surrogate could be a small image (with a
textual alternative as fallback).
Note that it generally depends on the context what surrogate
is best, so an author should have his word to say. For example,
it might be adequate to use plain "o" as a fallback to "ö",
or "oe", or it might be so that neither of these is acceptable.
Moreover, there is special markup for specifying surrogates
for unpresentable characters in general, irrespectively of the
way in which the characters have been entered into an HTML document.
The markup has the form
{fallback character surrogate}
.
Examples:
{fallback {char(e4)} ae}
{fallback ö o"}
Separate "libraries" of such rules can be written, to be used
via the {include}
mechanism. However, authors should
avoid specifying fallbacks without a particular reason. For example,
if a document is in German, one normally shouldn't include
fallback rules for the non-Ascii characters used in German, partly
because they normally shouldn't be a problem to UTD browsers,
partly because the choice of best fallbacks (when needed) should
normally be done client-side. But there might be a special reason
to suggest a particular fallback in a specific context.
For example, if you use capital omega as a symbol for the unit ohm,
you should normally specify a fallback that uses the unit name,
rather than let browsers use a fallback for capital omega.
When a browser uses such surrogates, it should give some hint of doing so, if this is possible without disturbing normal reading too much, using e.g. some light coloring. A browser that has such a feature must allow the user to switch it off. On the other, such a browser could, for example, provide the option of getting explicit information like Unicode character name and code displaed as a "tooltip" like note on mouseover.
For a large number of widely used characters, symbolic alternatives
to the use of numeric codes
exist. They have been defined formally as possible values for
the attribute of {char}
, e.g.
{char(omega)}
. For them as well, the optional content of the
element can be used to specify a surrogate; the empty element
{}
can be used as the surrogate, to indicate that
if the character itself cannot be shown, nothing should be shown
instead.
Note that the surrogate should be selected by the author according
to the intended meaning and context of the character.
Most SGML entity names are available, but for many of them,
more verbose and explanatory longer synonyms will be defined.
So as an alternative to a half-cryptic {char(agr)}
,
{char(alpha)}
could be used.
In the absence of a specified surrogate (i.e., content in a
character reference element), if a browser cannot display the
character, it should either
display the markup literally,
preferably
using equivalent, more mnemonic markup (like
{char(alpha)}
instead of
{char(agr)}
or
{char(03B1)}
).
The initial design of UTD allowed the use of
{alpha}
etc., making the mnemonic names element names.
The principle chosen for distinguishing markup as
forcing made this impossible in practice, though.
Doc
The top-level element - the root of the document tree - is always
a Doc
element. (Obviously, the name comes from "document";
there are particular reasons for naming the element that way, as
a truncated form of a common word.)
A Doc
element consists of the following,
in order:
Meta
element
Heading
element (which may optionally
contain Subheading
elements that indicate parts of
the heading that are less important than the rest of it)
{Introduction}
element and/or
any number of optional {Motto}
elements
Doc
elements,
corresponding to major structural parts of the document.
{Backmatter}
element, which may contain
{Credit}
,
{Appendix}
, {Glossary}
,
{Bibliography}
, and {Index}
elements
in addition to other content such as headings and blocks.
The parts excluding the
Meta
element
constitute the
body of a Doc
element; but note that no
specific markup delimiting the body is used, since UTD tries to
avoid redundant markup.
The Meta
element should contain all the material
that is "metainformation", i.e. information about the document rather
than document content proper. Metainformation includes authorship
information, copyright notice, description of intended audience,
keyword list, date of creation and last modification, etc.
An {Introduction}
element contains material
that prepares for the main presentation.
For example, a formal presentation could have an introduction
that describes the concepts in informal terms, trying to make
it easier to understand the logical behind the formal things.
An {Introduction}
element could also present some
historical background, some general motivation, and
instructions on how to read the material.
Its content is in principle
as for a {Doc}
element, though in practice it's typically
just a sequence of blocks. The difference between an introduction
and a foreword
is sometimes vague, but the basic idea is that a foreword
tells about the creation of the work, whereas an introduction
is part of the work and introduces its topic.
Note: The word "preface" might mean either a foreword or an introduction
or a mixture of the two. It might be in some sense best to allow
markup that classifies some material vaguely as a foreword or as
an introduction. However, in this case, the distinction between
metadata and data proper seems so important that it is "forced" upon
such material: you're supposed to put information about
the document into a {Foreword}
element inside a
{Meta}
element and use the {Introduction}
element for information for actually reading the document.
If you cannot make such a distinction, e.g. if you are marking up
a document that must not be changed contentwise, put the mixed
material into an {Introduction}
.
(The practical reason is that forewords are normally skipped
or read very cursorily by most readers if they ever see them, whereas
introductions should be read first. It is a smaller problem
to have to read some uninteresting material than to miss some
essential introduction.)
Any number of motto
elements may appear before
or after an Introduction
elements. Such an element
specifies a motto, which is intended to be a compact presentation
of some key idea in the content or related to the content.
Quite often a motto is a quotation, but it need not, and
quotations are to be separately marked up using
Quote
markup.
The {Meta}
element content can be just normal prose if
desired, but it may also contain further structuring that divides its
content into separate parts. Such structuring will be very useful
in different methods of automatic processing, but it is not formally
required. The idea is that e.g. authorship information be preferably
put into {Author}
elements placed inside a
{Meta}
element, but you could include it as normal
body content (and then there's no simple automatic way to recognize
that it is authorship information), or put it inside a {Meta}
element e.g. as a normal paragraph (and then it would be automatically
recognizable as metadata but not specifically as authorship metadata),
or put it inside an
{Author}
element, in which case it
is automatically recognizable as authorship metadata but with an implied
request to display it in a place that corresponds to its
position in the body.
When metadata is inside a {Meta}
element, a browser is expected to display it optionally
(i.e., by user request) in a manner that is part of the browser's
user interface, hence uniform across documents. The idea is that
users should have control over the repertoire and presentation
of metadata. On the other hand, an author can specify
whether metadata is to be also treated as document
content proper, e.g. displayed as part of it in normal presentation.
An element like {Author}
may contain the information
in "free format" as desired, such as
{Author Jukka K. Korpela, jukkakk@gmail.com}
or as a more structured format, such as
{Author {name Jukka K. Korpela}, {email jukkakk@gmail.com}}
or even in a more refined format (with markup that explicitly indicates
which string is the surname, for example).
Moreover, various
formalized information can be added.
For example, the {Audience}
element is metadata that
specifies the set of people for which the document has been created.
You're supposed to say that in everyday language, like
{Audience This document is for people whose age is 18 years or more.}
but at some later point, one might define some standardized,
language-independent formal way of specifying such things, and then
you might enhance the markup to be
{Audience(age: 18-) This document is for people whose age is 18 years or more.}
Note that a browser that does not support the formal way would still
present the prose description to the user, and the user would always
have the option of seeing the prose description too, in addition to
some browser-wide way of presenting the information code into the
formalized notation (or just filtering out the entire document,
if so instructed).
The {Audience}
markup could probably be used
for indicating e.g. the presence of sexually explicit material
since the main purpose of such metainformation would be related
to the audience.
The Created
and Modified
elements
provide information about the original creation and later
modifications of the document, respectively.
They may contain Author
elements (especially if
the original document author and the editor of the changes are
different persons), time
elements, and other content,
such as a free-format description of the technical creation process
(e.g., programs used). Example:
{Created This document was originally written in
{time August 2001}.}
A Document-class
element specifies the general
type (class) of the document, such as poetry, novel, picture gallery,
or log file.
This is expected to become
very useful when standardized and widely used
categorization has been defined.
For example, search engines
might then look for particular classes of content only, or filter
out some classes. In the meantime, free text descriptions can be
given, e.g.
{Document-class elementary tutorial}
,
and they might be shown to users. When standard
categorization will be used in the future,
it can be added (in parentheses), and
the free text descriptions can be left there, as giving further
information, or the same information in a different form (and probably
in a natural language which is the same as the document language).
A Keywords
element lists key words and phrases
characterizing the essential content of the document,
separated with commas. It should not be displayed by browsers
except on explicit user command. It can be used by indexing robots,
but it is expected that this element will be of limited usefulness
due to the sad history of "keyword spamming".
The robots
element is included for
compatibility with the Robots Exclusion Standard.
Its content is a list of options to be recognized by
indexing robots.
It corresponds to
<meta name="robots" content="options">
in HTML. Note that "UTD aware" indexing robots could pay attention to
{Master}
elements: if the master URL is different from
the current one and has been visited by the robot, the robot might
quite naturally skip the current document as if it contained
{robots noindex,nofollow}
.
A Source
element specifies the information source
of the content of the document. For example, a Web page containing
a digitized version of a book could, and normally should, contain
a {Source}
element containing a statement of this,
normally with exact bibliographic information.
A {Title}
element
specifies the recommended basic title for citing the
document in other documents, in bibliographic data bases,
in browsers' hotlists and history lists,
etc. When it is inside a {Meta}
element,
it is not intended to be shown
as part of the document in normal situations,
but it could be displayed as a label
for a window displaying the document or as (part of) a page
header or footer when a document is printed.
Multiple
{Title}
elements are allowed primarily in order to let authors specify the
title in several languages.
If several such elements are present in the same language, they should
be considered as alternatives, e.g. so that in a context where
the physical length of citation titles is limited, a program can select
the longest of the alternatives that fits.
Notes: The content of {Title}
is not
limited to plain text, but authors should take into account that
in most usages, the content will be presented in a simple manner,
often using one font only, and often truncated (e.g. to 60 characters).
Truncation, if applied,
must be indicated by the browser e.g. by trailing "...".
In the future, an attribute like use
could be defined, so that an author could specify different
titles for different purposes.
{Abstract}
presents the basic content of the document in a compact form. It should
be useful as a standalone description, even without access to the
document as a whole. On the other hand, it should not contain anything
that is not present in the rest of the document.
Browsers should
give users the option of viewing documents with abstracts or without
abstracts, or viewing just abstracts (e.g. when browsing through
search results). An abstract should typically have a length corresponding
to a printed page or less. A typical example is the
abstract of a scientific research paper. Multiple
{Abstract}
elements are allowed primarily in order to let authors can specify the
abstract in several languages, but the abstracts could also be written
for different audiences and purposes, in which case an
{Audience}
element should be used
inside the {Abstract}
element,
and browsers should let users select an abstract
from a menu constructed from
{Audience}
elements.
A {Legend}
element contains explanations of notations
used inside an element (typically, the entire document).
A {requirements}
element describes what
information the reader is expected to have in order to understand
the content of the element. It may contain specific suggestions on
sources for such information.
Example:
{requirements {Paragraph The reader is assumed to be familiar
with the basics of HTML and CSS.}}
A
{Topic}
element
specifies the general topic area of the
document. Various formal systems for this are to be defined,
assumably making use of existing classification systems,
so that attributes to the element specify the classification
scheme and classification data.
But in addition to this, the topic can be described in a natural
language in the content of the element.
In addition to assisting classification, relevance of hits in searches,
etc., such markup migh help automatic translation and other processing,
especially for small documents, since the translation program could
use topic information in deciding translations for words and phrases.
Cf. to {Context}
markup at text level.
A {Foreword}
element
contains material which
is traditionally written into a foreword in books, such as
explanations of the reasons for writing the document, its creation
history, and acknowledgements.
It is about the document rather than part of the document
content proper.
It need not be the first element in
a document, and its physical presentation may vary. A browser might
display the foreword last, or suppress it, or just indicate the user
that a foreword is available and provide some mechanism for viewing it.
An {Acknowledgements}
element can be used inside
a {foreword}
element for saying thanks to different people
and organizations that contributed to the creation of the document.
A {copyright}
element contains a copyright statement
for the document. Since the markup is not forcing,
the content would typically begin with something
like "Copyright © ...".
A {toc}
element contains a Table Of Content.
Note that multiple {toc}
elements inside an element
are allowed; one could e.g. have a compact table of content that
lists just the top-level headings, and a more detailed table of content
with two or more levels of headings, perhaps even with short annotations
on them. Moreover, e.g. a list of figures or list of tables
should be marked up using toc
, since they are
logically (partial) tables of content.
A {Backmatter}
element
contains material which is not normal document content proper
but not naturally classifiable as metadata either.
Appendixes are a typical example.
One possibility is to simply include such material into
a {Backmatter}
element (which is to appear last in a document),
using normal markup inside it, as in content proper.
But an author can, and normally should, additionally use
{Appendix}
,
{Credit}
,
{Glossary}
, and
{Bibliography}
, and {Index}
elements
to indicate the logical role of each part of the back matter.
The meanings of the subelements are probably obvious.
(The {Credit}
element contains expressions of gratitude
for contributions; it should not be used to specify e.g.
the photographer of a picture or the author of some quoted text,
for which the {Author}
markup should be used.)
Note that
{Index}
refers to data that helps in accessing the
content proper of the document. It is typically
could be an alphabetic list of
words that are references to occurrences of the words in the content,
or it could. It can be visually presented as an index with page numbers
(on print media), or as an index of links, or as a query form.
Doc
elementsAs the explanation above says, a
Doc
element
may contain Doc
elements. This is the UTD approach to
sectioning. To divide a document into sections, you put each section
into an inner Doc
element. Thus, each section can
have a heading and other structural parts. And this can be continued
as far as needed, to subsections, subsubsections, etc.,
though it is seldom useful to nest Doc
elements very deep, unless the outermost document happens to be
book-like.
This uniform system implies that we can insert a UTD document
as such into another UTD document,
no matter whether such embedding is done using editors,
authoring tools, servers, or user agents,
provided that some relatively simple transformations are made.
Sections could be written so that they
are not dependent on the level of nesting into which they will be embedded.
This means that there is need to change h2
elements to
h3
elements as in HTML
when making a document a hierarchic part of another document
which already uses h2
elements for higher-level headings.
This would make it simpler to maintain the same information
in various forms and contexts.
- The transformations needed when physically inserting a document
into another as a subdocument deal with eventual differences in
character encoding (to be solved with transcoding),
eventual needs to rename elements to make id
values unique,
and the need for inserting or modifying
a {Base}
element, so that relative URLs preserve
their meanings.
Note
that despite lack of explicit level indicators in heading markup,
a browser can start rendering a document without parsing it
as a whole, if it uses heading styles "top down". This might not be
optimal behavior, though, since for small documents, unnecessarily
emphatic headings would appear. This can be handled using style sheets,
perhaps in conjunction with LaTeX-like documentclass
indications. (Tools for such indications have not been included
into this proposal. We might e.g. have an attribute for
Doc
, with values like
note
(a short note, typically one paragraph),
letter
(a few paragraphs, usually with one heading only),
article
(typically a few printed pages, often divided
into sections with headings),
report
(longer, often with two levels of headings),
booklet
(e.g., a detailed manual),
book
(roughly corresponding to a printed book), and
library
(a collection of interrelated books).
The name Doc
was selected since names like
Part
or Chapter
or Section
would have been unnecessarily "loaded" with meanings and
connotations from everyday language. A Doc
can correspond to various levels of structuring, according to nesting.
The name Document
would be fairly neutral, as long as the
general idea of nesting has been understood: a document may contain
documents. But the word "document" typically suggests something
relatively prosaic, mainly verbal, and even documentary, whereas
the Doc
concept is much more general.
The truncation is an attempt at some alienation:
a Doc
element is a document, but in a very broad sense,
which covers poems, tales, advertisements, letters, etc.
Although Doc
elements can be nested as needed,
not all fundamental division into parts is based on them. There's
a basic level of structuring text that is better done in a different
way, using blocks of different kinds. The block concept
is a generalization of a paragraph concept. Although paragraphs could
be viewed just as one level of nesting
Doc
elements, it is more natural to consider them
as basic building blocks above the phrase level.
We don't usually consider an isolated paragraph as something that
could be seen as a "document", whereas
a section, or even a subsubsection, is a different matter.
Well, this is somewhat debatable of course; but my view is that
this approach corresponds to a long tradition of written communication.
The {Paragraph}
element, or
{Par}
element for short, is a basic structuring block
of a neutral kind. It is typically a construct that consists of a few
sentences that belong closely together, perhaps expressing a single
idea, perhaps describing a particular situation or event.
It's basically what normal paragraphs are in books.
Paragraphs as such cannot be nested.
Paragraph markup should be used for pieces of text that logically correspond to something like a paragraph in a book: text that presents some idea, course of events, or some other thing and constitutes the basic structural level above the sentence level. It should not be used just to enclose some text into a block-level container. Contrary to "Strict" versions of HTML, UTD does not require (and does not encourage) that all text be enclosed into such containers. You can, and should, use text as such when no natural container markup applies.
Quite often, a paragraph has some special role that makes it logically different from other paragraphs. For example, a paragraph might discuss some less important subtopic (so that it would be typeset in a smaller font, for example), or it might be the last paragraph in a long sequence and summarize the conclusions drawn from the discussion. Instead of defining an optional attribute that specifies a particular role of a paragraph, UTD separates the basic paragraph markup from markup for such semantic issues (see the section on "general" markup). One reason to that is that although a conclusion, for example, is typically presented as a paragraph of its own, it could well be just a sentence inside a paragraph, or it could span several paragraphs, or be presented as a table.
There are, however, three special paragraph-like elements,
to be used especially for letters and other documents with
identifiable sender and recipient:
{Recipient}
,
{Opening}
,
{Closing}
, and
{Sender}
.
The {Sender}
element specifies the person or
organization that is to be regarded as issuing the document;
it may deviate from the author. (E.g., a document might have been
written by someone, then approved by an organization and sent as
the organization's document.)
The {Recipient}
markup has obvious meaning. For
documents with multiple recipients, a separate
{Recipient}
element should be specified for each of them.
Note: A paragraph is a relatively low-level structure in
a document.
Any point of major discontinuity in time, place, or subject
matter should be indicated by sectioning
(basically, {Doc}
markup)
rather than just paragraph breaks. For example,
in a printed novel, it is customary to divide the text into
"literary paragraphs", with first-line indents and no extra
spacing between paragraphs. Such a paragram would correspond to
a {Paragraph}
element in UTD, whereas empty vertical
spacing (roughly equivalent to one line, typically) would correspond to
higher-level structure (a {Doc}
element, typically
part of an enclosing {Doc}
element that constitutes
a chapter in the novel).
The {Lines}
element specifies a structure
consisting of logical lines, such as a stanza of a poem,
or a typical log file created by a computer program.
Each line is presented as a
{Line}
element, but the markup for such elements is
implied: within a {Lines}
element, newlines count
as terminating a {Line}
element,
unless they are themselves within
such an element. The {Postal}
element
is similar to {Lines}
but has defined semantics:
it indicates a postal ("snailmail") address.
The {Lines}
element is "text level" in the sense
that in may appear inside a paragraph, for example. However,
consecutive {Lines}
elements are to be treated
and displayed as separate.
A {Line}
element may appear on its own too,
in explicit markup.
In any case, the content of a {Line}
element
is to be treated
as an indivisible
line of text, not split across several physical lines in visual
presentation, and preceded and followed by a line break.
Browsers should provide some mechanism for dealing with excessively long
lines, such as horizontal scrolling or some special notation (to be
explained in the browser documentation, but hopefully intuitive too)
for indicating that a long line has been broken visually.
The {List}
element is a basic constructor with
several variations, selected by the use of the
kind
attribute. The default is kind:ordered
,
which means that the list is an ordered collection of items,
ordered in the sense that the particular order in which the items
appear is significant and not just casual.
There is no explicit markup for the items: the content of
a {List}
element consists just of the items themselves,
in succession, separated by spaces if needed.
Note that
{List A B C}
is a three-item list; to specify a list item that contains a space,
you can use a "dummy" element,
e.g.
{List { A B} C}
A list can contain an optional {Caption}
element that
specifies a heading-like caption for the list
("list header"),
as well as
an optional {Head}
element that contains some
explanations about the list.
Two or more {List}
elements may have name
attributes with the same value. In that case, they are considered
as parts of the same list, especially for the purposes of
eventual item numbering.
There is no requirement that the presentation of an ordered list be numbered, though it often is. Authors should not include explicit numbers into list elements.
For a {List}
element,
the attribute kind:unordered
specifies that
the list items are in no particular order, i.e. the order in
which they happen to appear in a UTD document is coincidental,
caused by the fact that text needs to be written as linear.
A browser may display the items in any order.
The attribute kind:lateral
specifies that
the list items "collateral" or logically parallel, to be
viewed simultaneously when possible.
This would apply to
some text and its translation(s), or an image and its caption,
or synchronized multimedia. (Multimedia itself is not part of UTD,
which is a text data format. But UTD may contain references to
non-text media and can be used to "embed" it in some manner,
with some indication of the logical relationships.)
In particular, kind:lateral
could be used
for purposes for which frames are sometimes
used in HTML. The use of frames is almost always a wrong choice,
but UTD lets you make such mistakes, in some sense, and
even use a somewhat frames-like setting in those rare cases
where it makes sense. Note that the typical use of frames
at present is just for showing a table of content of a site along
with each page of content, and the UTD approach is that
authors should specify references suitably (e.g.
by a reference with relation:maintoc
onto each page),
and browsers and users can take advantage of that, in
different ways. (A browser could provide, as a basic part
of its user interface, a button that acts as a link to
the main page of the "site" of the current page, or it
it could automatically display the toc that of page,
if specified via markup, on the left of the current page,
if the user so wishes.
Question: Should we also define references that "update two frames at once"?, or references to particular combinations of "frames"?
A browser should make some reasonable effort to select the default presentation style for a list according to its size and the nature of its items (in terms of markup used there). For example, a list with a large number of very small items should typically be presented in multi-column style.
Should we introduce some semi-presentational attribute for the purpose? It can be inconvenient if a browser has to read and process a large list before it can start displaying any part of it.
For any list, a browser that runs in an interactive mode may display just part of the list, provided that it lets the user access any part of the list using some mechanism, such as scrolling or queries.
In addition to appearing as a block by itself, a list may appear as a constituent of a paragraph. Such a list might be displayed in different ways, not necessarily involving line breaks; for a short list, the items could be shown inside the paragraph as numbered or otherwise indicated as items of a list.
Lists can be nested, too.
There's nothing extraordinary about this,
except for an important special case: a table is
a list of isomorphic items,
explicitly designated as such.
This means that the items of the (outer)
list are themselves lists that share some common internal structure;
moreover, this fact is indicated by the use of different markup.
Thus,
{List {List Finland Helsinki} {List Sweden Stockholm} {List Oslo Norway}}
would be just a list with lists as its items, but
{Table {Row Finland Helsinki} {Row Sweden Stockholm} {Row Oslo Norway}}
would be a corresponding table.
The attribute header
can be used in a
{Row}
element to indicate it as a header row that
contains explanations of the meanings of the columns rather than
actual data, e.g.
{Row(header) Country Capital}
. Technically,
both {Table}
and {Row}
elements are
lists, i.e. they have the general properties of
{List}
elements.
Basically, a table is a two-dimensional structure, where it makes
sense to refer to the nth column.
As regards to presentation, browsers are expected to do their best
to display tables
in a manner that corresponds to their columnar structure
(even without any suggestions by the author in a style sheet).
For example,
if all items in a column
(except perhaps in the header row)
are whole numbers, a browser should
present them as right-aligned. Similarly, numbers containing a decimal
point or comma should be aligned on the separator.
Browsers might analyze the markup used in the cells, noting that
header cells typically differ from data cells. At the simplest,
a browser could detect that all cells in a column
except possibly the header row cell contain just an
{integer}
element, and right-adjust them.
This however is
a quality of implementation issue; browsers might also
apply some heuristics based on the data content of the cells only,
e.g. recognizing that some column contains numbers only.)
However, there are situations where a visually tabular presentation
is not possible, most obviously all non-visual situations.
For this and other purposes, a table should have its rows and columns
(except the header row)
named. For a row,
the default name is the content of its first cell. For a column,
the default name is the content of the cell in that column in the
header row of the table, if present, or an integer corresponding to
the column number otherwise. A name different from these defaults
can be assigned by using
a name
attribute in the cell whose content would otherwise
be used. Often one would do this to use a more concise name,
e.g. as in
{ (name:Finland) Republic of Finland}
.
The names of the columns of a table must be distinct, and so
must the names of rows. Typically the names could be used by a user agent
that accepts a name pair as input and responds by reading the content of
the cell determined by that pair, interpreted as row and column name.
Such a query-based presentation could be used by visual browsers too,
especially for large tables. (A browser could e.g. show a small portion
of a table in a scrollable window and provide, as its own user interface
widget, different ways of selecting a row, and perhaps a column too.)
Q: Should the header row be obligatory?
Thus, UTD tables are purely structural. They are not to be used for mere layout purposes, for which tools external to UTD should be used.
Element name | Meaning | Notes |
---|---|---|
abbr | The content is an abbreviation of some kind. | See notes below. |
biblio | The content is bibliographic information about a book or other publication. | Attributes and special inner elements to be defined, in order to create the possibility of using uniform, automatically processable format. |
clause
| Explicitly indicates that the content is to be taken as one clause, i.e. the largest grammatical unit below the sentence level. | |
context | Indicates the context of
interpreting the words and phrases in the content, in a manner
similar to {Topic} .
| Especially useful in translation and otherwise when some words are used in meanings that differ from their meanings in the overall topic area (as explicitly specified in markup or deducible from content). |
Creation | The content is the name of a book,
serial publication, TV program,
article, song, symphony, or similar creation. Notes on
name apply.
| Forcing, but browsers may use the same presentation as
for Quote , though they should make a distinction
if feasible.
Typical simple presentations are the use of italics and
the use of quotation marks.
Cf. to cite in HTML.
|
ellipsis | The content is not document content proper but indicates that some content has been omitted e.g. from a quotation. | In practice,
the content of the element specifies the suggested textual
presentation of the ellipsis.
That is, a browser may use either that presentation or some
other indication of ellipsis.
Examples:
{ellipsis ...}
and
{ellipsis {emdash -}{nbsp }{emdash - }} .
|
Em | Same as Emphatic .
| |
emoticon | The content is to be taken as an expression of emotions, based on its visual appearance. | Example: {emoticon(explanation:smiling face) :-)} .
|
Emphatic | The content is to be emphasized locally with respect to its immediate environment. | See notes on the kind attribute below.
|
figurative | Indicates that the textual content is figurative speech. | Mainly intended for assisting automatic (or other) translation. |
firstname | The content is the first given name of a person. | |
High | The content has high importance
globally, namely at the level of the smallest enclosing {doc}
element. Style sheets can be used to suggest that such an element be
duplicated in visual presentation so that it appears as such (in normal
font) in running text and additionally appears in a separate
a small "caption" block, as often done e.g. in long printed
articles.
| Must be presented in some very emphatic manner by a browser. Typically used to emphasize key words and phrases. |
idiom | The content is an idiomatic phrase of the language. | Translators are warned against translating it literally. |
indexable | The content, typically a word
or short phrase, is suggested for inclusion into an index.
The attribute weight , with a numeric value,
can be used to indicate the
relative importance, weight:1 being the highest
(main discussion of the topic).
| A good-quality index generator would pay attention to
definition markup, showing references to
defining occurrences emphatically. It could also note whether
the indexable word is inside {example} markup, etc.
|
integer | The content is an integer (a whole number). | E.g. {integer fe} suggests that "fe" be
interpreted as a number (in hexadecimal notation). This may
help automatic translation, indexing robots, etc., to avoid
trying to recognize it as word or abbreviation.
|
ix | Same as indexable .
| |
j | Same as join .
| |
join | The content is a grammatically composed construct. This markup can be used to indicate grammatical grouping, especially to resolve ambiguities in order to assist automatic translation and analysis. | Example: different {join audiences and purposes} .
A browser could pay attention to this markup in formatting the document
for presentation, avoiding line breaks inside the construct,
if feasible.
|
lex | Same as lexical .
| |
lexical | Lexical information. Attributes to be defined in a manner that makes it possible to associate a word or phrase with lexicographic information, e.g. to indicate that a word is used as a technical term or otherwise in a special meaning, or just to disambiguate between meanings, or to indicate pronunciation of a word. |
We might first define attributes
that indicate classes of words, in languages where such distinctions
matter, using fairly simple markup like
{lexical(class:verb) record} .
|
middle | The content is the middle name of a person, or its abbreviation. | Example:
{person {firstname Jukka} {middle K.} {surname Korpela}}
|
misspelled | The content is an intentionally
misspelled word or phrase, e.g. because the document discusses
misspellings or a quotation contains a misspelling.
An optional text attribute
can be used to give the correct spelling.
| Useful e.g. when discussing orthography or when quoting a text that contains a misspelling that is not to be fixed when quoting. |
money | The content indicates a sum of money. | The std attribute can be used to specify the
sum using standardized notation in fixed format, e.g.
{money(std:USD 0.50) 50 cents} .
|
n | Same as name
| |
name | The content is a proper name, instead of e.g. a common noun. | See notes below. |
negation | The content consists of words describing what the document does not discuss. | Could be used by indexing robots and search engines. |
neologism | The content is an invented word or expression. | A spelling checker should not report an error, and might recognize further occurrences of the word the same way, even if not explicit markup for them is used. |
number | The content is a number (integer, fraction, or real number). | For whole numbers, integer is more specific markup.
|
p | Same as person
| |
person | The content is a name of a person | Inner markup like surname may indicate
the structure of the name.
|
price | The content indicates the price of something. Possible internal structure to be defined. | |
pronounce | The content has exceptional pronunciation, or its pronunciation could be ambiguous without a hint. | The ipa attribute can be
used to specify the intended pronunciation as a string of
Unicode characters that represent IPA phonetic symbols.
Otherwise, the explanation attribute, if present, should
be taken as specifying the pronunciation.
This is especially useful
for abbreviations, e.g.
{abbr(explanation:World Wide Web, pronounce) WWW} .
|
q | Same as quantity .
| |
Rem | An author's remark about some technicality of the document itself. Not to be regarded as part of the content in any way. | Allows "commenting" UTD markup, among other things, and also
pure reminders like
{rem Ask Phil to check this paragraph!}
|
s | Same as string .
| |
sarcasm | Indicates that the
content is sarcastic in the sense of saying the opposite of what
was actually meant, e.g. {sarcasm What a brilliant idea!}
| Mainly intended for assisting automatic (or other) translation. |
sentence
| Explicitly indicates that the content is to be taken as one sentence grammatically. | |
Sound
| The content is to be taken as an attempt to describe a sound, rather than as actual words used by a speaker. | Forcing markup. A speech generator should try to produce an
actual sound corresponding to the description. In visual presentation,
parentheses will typically be used; e.g.
{sound Sigh.} would get displayed as
(Sigh.)
|
string | Semantically empty markup, except that explicitly indicates inability (or unwillingness) to use any semantically significant markup. | Typically used in "record" descriptions, indicating lack of specific information about the structure and meaning of the content. |
Sub | Subscripting with semantic significance. | Forcing markup, i.e. a browser that cannot display subscripts
proper must e.g. explicitly display {Sub ...} .
|
sub | Subscript style. | Markup that a browser may ignore if it cannot display subscripts.
E.g. H{sub 2}O (here subscripting is not counted
as semantically significant, since H2O is tolerable presentation).
|
Super | Superscripting with semantic significance, such as exponentiation. | Forcing markup, i.e. a browser that cannot display subscripts
proper must e.g. explicitly display {Super ...} .
|
super | Superscript style. | Markup that a browser may ignore if it cannot display subscripts.
E.g. M{super lle} , 1{super st} .
|
surname | The content is a surname of a person. | |
t | Same as time .
| |
Taxon | The scientific name of a biological taxon. | To be displayed mostly in italics, and in any case differently from normal text. See notes below. |
time | The content is an indication of time (moment or period). | The attribute std can be used to specify the
ISO 8601 equivalent.
|
uncertain | The content might not be an entirely correct presentation of the material it tries to present. | See notes below. |
var | The content is a "variable" or "placeholder" (as in non-mathematical usage; mathematics uses other markup). | Not forcing. Browsers should try to indicate "variables" as something special, but authors should formulate their texts without relying on this. |
wdiv | Indicates the content as a syllable or other part of a word, for the purposes of hyphenation and word division. | The attribute pref can be used to specify how
preferred the hyphenation point after the syllable or part
is in word division;
the value is an integer between 0 (disallowed) and 7 (most preferred),
the default being 4. The attribute separate can be
used to specify the form of the content to be used if the word
is divided. For example, in Swedish, "tillämpa" should become
"till-lämpa" if divided; this can be indicated using
{wdiv(separate:till)til}{wdiv lämpa} .
Note: hyphenation should normally be based
on language information and hyphenation algorithms for
different languages; explicit markup is to be used for special cases only.
|
The Emphatic
(or Em
)
markup should be used only when there is no adequate
emphasizing element which is
more specific in its meaning. The kind
attribute
can be used to specify which kind of emphasis is expressed:
kind:opp
indicates opposition or contrast with
the content of a nearby element, e.g. as in
use {Em(kind:opp) logical} markup rather than
{Em(kind:opp) physical} markup
kind:att
emphasizes a word or phrase, suggesting that the reader (or listener) should pay special
attention
to it when reading a statement; this comes closest to "plain emphasis" that one can imagine but it has the specific feature of making
even a very common word (even a word like "a") important in a context
kind:key
indicates a key word or phrase, typically one which shows what a paragraph discusses;
notice that key
is intended just to indicate
visually or audibly that something is important whereas
att
can make it important, often changing the tone or even the meaning of a statement; an indexer could pay a lot of attention to
words inside {Em(kind:key)}
but
should probably treat anything inside
{Em(kind:att)}
just as normal text.
The uncertain
markup would
typically be used for texts taken from manuscripts, to indicate
that they might have been read wrongly, or for translated texts that
might be wrongly translated. The p
attribute can be
used to specify the degree of uncertainty, as a number between 0
(totally uncertain) and 1 (totally certain).
The name p
comes from "probability", though its value
is usually not a probability in the statistical sense; rather,
it is a measure comparable to probability, and could be purely
subjective, or could be based on some calculations.
The markup is not forcing, but browsers should by default show
uncertain text differently from normal text.
To specify different alternatives, such as alternative ways to
read a word in a manuscript, use the {Alternative}
markup with {uncertain}
markup inside it. Example:
{Alternative {uncertain(p:0.9) quid} {uncertain(p:0.1) quod}}
A good-quality browser could display the first alternative, with the
largest p
value, indicating it as uncertain e.g.
by using a gray background, perhaps
with the intensity indicating the uncertainly value,
and show a small popup window explaining the
other alternatives, when the user clicks on the word.
A translator that finds a word that it does not know and that does
not appear to be a proper name should probably use the words as such
but flagged as very uncertain, with a p
value like 0.1.
In text-level markup
the attribute base
can be used to specify the
base form of a word or phrase
or other expression; for abbreviations, it is interpreted as
specifying the unabbreviated expression.
This can be utilized in
indexing, translation, etc. It is expected that this attribute
will mostly be used in special situations, when the base form
is not easily constructible using normal mechanisms of language
analysis.
The basic purposes of
{abbr}
markup are:
The attribute explanation
can be used to specify
what the abbreviation comes from.
This does not imply that the abbreviation should be spelled
out that way in reading. An abbreviation may have become an expression
of its own. For example, you cannot substitute the expansion of
"HTML" for the abbreviation. Note that pronunciation is to
be specified separately, if desired.
The agent
element indicates
a person or organization as an "acting subject",
such as an author, an owner, or a party of a contract.
Markup for an expression denoting an agent can be simply
{agent ...}
with no internal structure indicated, but the content ... may have
markup too, to indicate the structure (which may help indexing robots,
searching, and even suitable display):
{name}
(several alternate, synonymous
names can be specified)
{abbr}
(abbreviation)
{email}
(E-mail address, in Internet format,
unless otherwise specified in an attribute)
{Postal}
(postal, or "snailmail", address)
{url}
(main page URL of Web pages describing the
agent)
{phone}
(with optional kind:fax
or kind:mobile
etc.).
Example:
{agent {name World Wide Web Consortium} ({abbr W3C})}
The {name}
element can be
used elsewhere too, as text level markup.
In practice, this element is probably most useful for words which might
otherwise be taken as normal words in the natural language in
which the document is written.
It can be useful e.g. in helping automatic (or human)
translation so that
a sentence-initial common noun is not mistakenly regarded as a name.
More generally, it might
assist translation so that names are
treated differently from normal words; this does not necessarily
mean that a name is kept untranslated, but it suggests that it
should be translated only if it is known what is the corresponding
expression for it as a name in the target language.
The {name}
markup could also
help search engines to improve their functions, e.g. searching
in a manner which distinguishes between a normal word and an
identically spelled name.
The markup may have, but usually does
not have, an effect on the way the content
is presented by a browser. If the content consists of two
or more words, they are designated as forming a composite name.
Thus, {name xxx yyy}
is logically different from
{name xxx}
{name yyy}
.
In practice, software such as a translator should primarily
try to find a phrase-level equivalent to a name, e.g. for
{name European Union}
, instead of translating
its parts separately.
When a name has several occurrences in a document, should
each of them be marked up explicitly? Although it could be
done, programs processing UTD documents are encouraged to treat
{name}
markup as document-wide in the sense that
further occurrences of the same content are to be recognized.
Generally, this requires language-dependent morphological analysis
(to see that e.g. "Jukan" is a form of "Jukka"), so in some cases
authors migh wish to use explicit markup for each occurrence.
There is also the attribute scope:this
to explicitly
say that {name}
markup should be taken as applying
to one occurrence only.
For names of persons, real or fictive, the more specialized
{person}
element should be used instead.
Notes on {name}
apply to it, too.
Browsers may present the content of
{person}
elements in a some specific way, corresponding to such widespread typographic conventions as using bold face or small caps for people's names.
A browser might do this for the
first occurrence of a person name only;
identity of names in this respect should be based on the
base
attribute, if present.
Similarly, the Creation
element should be used e.g. for names of books.
The {Taxon}
element
would normally be used simply as
e.g. {taxon Homo sapiens}
, and italics (or equivalent,
such as underlining when italics is not available) must be used then,
according to normal rules for biological texts.
An optional attribute
level
specifies the taxonomic level, defaulting to
level:species
. The other standardized values for
level
are subspecies
below the species level and
genus
, family
, order
,
class
, phylum
, and kingdom
above it; other values can be used for special purposes.
For level:species
, the scientific name consists of
a genus and species name; for level:subspecies
,
there are three parts;
for other values of level
, the scientific
name proper is a single word. Any extra parts are to be interpreted
as additional information and displayed in normal (upright) font.
Any word that ends with a period is to be taken as an extra part;
this will result in the normal display of markup like
{Taxon Homo sapiens subsp. sapiens}
or
{Taxon Homo sapiens {abbr(base:Linnaeus) L.}}
.
A Taxon
element has an implied language:la
attribute
and it is implicitly enclosed into a notranslate
element.
An explicit language
attribute for it or inside it
is to be interpreted as specifying the language according to which
the name or a part thereof should be pronounced.
Should we regard scientific taxon name system as a code
and use {code}
markup for it?
Probably not, since the taxon names are closer to natural
language expressions than codes generally are. For example,
in Finnish, a taxon name could be declined (though this should
usually be avoided), e.g. "Homo sapiensiksen"
'of Homo sapiens', so it's treated more like a loan word than
like a code. (Our example is hard for very detailed markup, since
the suffix is in Finnish but the stem varies according to Finnish rules, so
that -s becomes -kse-.)
In any case, one should use the base
attribute to specify
the base form of the word, e.g.
{Taxon(base:Homo sapiens) Homo sapiensiksen}
or
{Taxon Homo { (base:Homo sapiens) sapiensiksen}}
.
General elements can appear at different levels of structure. (The exact rules for this are to be specified.) When such an element occurs e.g. inside a paragraph, it may contain only what a paragraph may contain.
UTD uses general elements rather than general attributes like those in HTML, but general attributes can be used as shorthand for elements in a sense, as explained below. Moreover, there are a few general attributes without an element counterpart.
An id
attribute has the same meaning as in SGML
and in HTML: it provides a unique identifier for an element.
Its value shall not begin with a digit.
When a URL with a fragment identifier is used to refer to the document,
the fragment identifier is interpreted as referring to the element
with an id
value that matches (case sensitively) the
identifier.
It is recommended that
this attribute be used
at least for all major parts of a document,
such as inner {Doc}
elements, so that they can be
conveniently referenced from outside the document.
On the other, "unnamed" (in this sense) elements can be referred to, too:
a fragment identifier which is a digit sequence n is
taken as referring to the nth element at the document's
first hierarchy level, n.
m refers to
the m subelement (at the second hierarchy level) of that element,
and so on. Such fragment identifiers may of course easily become obsolete
as the document is changed. The id
attributes should not
be changed after making the document public, even if the value turns
out to be poorly selected or becomes obsolete.
Should we also consider defining "search URLs", e.g. URLs referring to the first occurrence of a given string? This might be more practical in many cases. But it's actually a URL issue.
A class
attribute can be used in conjunction
with style sheets, as in HTML.
An explanation
attribute can be used to specify
a short explanation of an element as part of the document. This
roughly corresponds to the title
element in HTML,
and could be implemented as a "tooltip".
See the comparison with HTML for other general attributes in HTML.
{Ref(url:address ...}
indicates that the
immediately enclosing element
(thereby designated as the referencing element of a
reference)
in some sense refers (points) to the entity identified
by address, which can be a URL or of the form
#
element-id, where
element-id is the id
attribute value for
some element, or a combination thereof (a URL followed by
#
element-id, i.e. a "URL reference" or
"URI reference", to use the official terms).
The word entity will be used to denote the target of a reference: an element in the same document, an element in another document, another document as a whole, or some part of a non-UTD document.
The meaning of
#
element-id (called "fragment" in some
specifications)
can be defined for external documents
of types other than UTD, too. For
HTML, SGML and XML documents, the construct referred to is the
element specified in the target document via an id
attribute
if present or, for HTML documents, via an
<a name="...">...</a>
construct.
For plain text documents, the value of id
is
interpreted as a line number reference (either a single line number,
meaning some unspecified amount of lines from that line onwards,
or of the form start-
end
specifying a range of lines).
Multiple url
attributes are allowed.
They are then taken as indicating alternative addresses
for essentially the same content,
e.g. mirrored copies of a document.
They should, by default, be taken in order of preference.
(Thus, when using a reference to access an entity, a browser
should first try to access the first url
value,
moving to the next if the access fails (e.g., an error response
is got or there is no response within some reasonable time).
The {Ref}
element can be implemented as a "hyperlink",
or such elements can be used to traverse documents e.g. for
indexing or document tree construction purposes.
By itself, a reference does not imply any particular endorsement
or any particular opinion or view on the referenced entity.
It might be seen as corresponding to a statement like
"See ..." in plain text, where "..." is a title or other identification
of a document. Or you could say that it just brings the referenced entity
into the reader's attention.
But authors can, and indeed should, indicate the reason
for setting up the reference, in a formalized manner; this naturally
does not exclude the use of prose descriptions of the same
(somewhere around the reference).
The relation
attribute, or rel
for short,
is used for the purpose: it indicates the author's view on how the
referenced entity relates to the referencing element.
Additional attributes, to be defined, will let authors specify
more information about the referenced resource, the nature of
the reference, etc.
When a reference, implemented as a hyperlink, is followed on
a browser and the referenced entity is a part of a document,
the browser should somehow
highlight the part of the target that the url
attribute refers to. Such presentation, if used,
should reflect the idea of being selected rather than
emphasized,
must be distinguishable from
all default presentations of emphasis in the document itself.
For example, a browser could use a background color that is suitably
lighter than the overall background color.
A browser must allow the user request for information
about a reference without actually following the link.
The information should consist of a suitable presentation of the
information in the {Ref}
element itself. Moreover,
a browser must support a method for requesting
for header information only for references with
http:
URLs,
and the browser should display a relevant part of HTTP headers then,
preferably in a human-readable form.
The idea is to let the user check for
information on the resource before making a potentially time-consuming
or useless request for the resource itself; it also allows the
user
check just the accessibility of a resource or getting information
like last modification date.
The possible values
of the relation
attribute
include
strings corresponding to UTD element names. For example,
rel:toc
means that the referenced entity is
a table of content for the referencing element, and
rel:example
means that the referenced entiry
provides one or more examples
of something discussed in the referencing element.
Moreover, for these and other values, a value expressing
the reverse relationship can be constructed by prefixing
the value by the string rev-
.
For example, rel:rev-example
indicates that
the referencing document contains an example for the
referenced document (which could contain a reverse link with
rel:example
).
(Question: Should we find another name, or should we
adopt the rel
and rev
attribute
duality from HTML specs?).
Other possible values for the relation
initially consist of the following:
relation value | meaning |
---|---|
alternate |
The referenced entity contains material that is essentially
an alternative presentation of the content of
the referencing entity, e.g. a translation, so that
they could appear as components of an
{alternative} element if they appeared in the same document.
|
confer | The reference suggests that the reader compare the content of the referenced entity with the content of the referencing entity. This value should not be used if another, more specific value in this set applies. |
content | The sole purpose of the presence of the content of the referencing element is to appear as the reference to the referenced entity. This is typical e.g. for a table of content where each item is a reference (link) to some section. |
etymology | The referenced entity contains an explanation of the origin of the word or phrase that constitutes the content of the referencing entity. |
main | The referenced entity is the main, or top-level, document ("home page") of the logical set of documents ("site") that the referencing entity belongs to. |
maintoc |
The referenced entity is the {toc}
element of the document of the "main page" of the "site"
(cf. to relation:main ).
|
more | The referenced entity contains information that somehow complements the information content of the referencing element. |
next | The referenced entity is the next entity to be visited in a "guided tour" through documents. At the simplest, some material could be divided into separate documents so that the "guided tour" is just the order in which the parts should normally be read. |
previous |
Same as rev-next .
|
support | The content of the referenced entity supports some claims, theories, or statements made in the referencing entity. |
up | The referenced entity is the "parent" of the referencing entity in a logical hierarchy of documents. |
xref | The reference is a cross-reference inside some material (often inside one document), pointing to the main discussion of a topic in that material. |
Questions:
The relation
attribute does not distinguish between
internal and external links. Should we introduce
a different attribute for this? Or even two different elements?
What about backward vs. forward references?
The visual presentation of links is important,
and it is important to learn from the problems of the methods that
current Web browsers use.
Consider a simple link, as in the following:
This will discussed in the
{Ref(url:#hypermystics) section on hypermystics}.
On a typical graphic browser, this might be displayed so that
the content of the {Ref}
element is underlined and
in a specific color,
as in current Web browsers, though some other, less striking
presentation (e.g. with a dotted colored line below) might be more suitable.
But on paper, some different presentation is typically desirable.
For internal links, it could consist of underlined
(preferably thinly underlined)
text followed by
a browser-generated section number and/or page number reference, such as
"This will be discussed in
the section on hypermystics (p. 42)."
The alternative of only including a page number is not good,
since it does not indicate what the "link text" (the referring element) is.
One possibility would be to insert some marker, like a small
forward-pointing arrow, in front of the "link text".
Such things are left to browsers and to
specialized UTD printing software.
Authors should not try to include any
section or page numbers into their text.
It is often difficult to say how explicit one should
say in the textual content that there is a reference present.
The naïve style "Click here" is certainly not something you're
supposed to use in UTD! And it really doesn't look good on paper,
or sound good in speech synthesis. But should authors have some way of
asking that browsers emphasize the presence of a reference?
(A browser running in "novice mode" could even show "Click here!"
on its own for such links, or a speech-based user agent could
explicitly ask the user whether he wants to follow the link.
In both environments, the user agent could present the
content of the {Ref}
element
as soon as it gets the slightest excuse for that from a user's
reaction that might get interpreted as "I might...".)
This is of course logically distinct from emphasizing the
referencing element.
A high-quality browser would present links with different
relation
values differently, with user options to
turn off the distinction between links and normal text for some
relation
values, i.e. for some types of links.
Apparently this wouldn't normally mean different presentation
for each link type but just for those that need to
be distinguished for some reason, in a particular environment.
A reference with, say, relation:glossary
does not imply that the referenced entity would have been written
especially for use as a glossary for the referencing document.
It could even refer to a general dictionary which is in no
particular way associated with the referencing entity. That would not
be particularly useful as a rule, but it could be very useful to
refer to a topical dictionary that contains explanations
and/or translations for the terms used in the referencing entity.
The attribute reference-content-language
, or
ref-lang
for short, specifies the language(s) of the
referenced entity, in a sense corresponding to the meaning of
the Content-Language
header in HTTP. If its
content is available in different languages e.g. via HTTP
language negotiation, a list of languages should be specified.
The attribute info
provides additional information
about the reference as a relationship between the referring element
and the referenced entity. (It partly corresponds to the title
attribute in HTML, except that the UTD attribute has more
specific semantics.)
The content, if any, of a
{Ref}
element
specifies a short summary of the reference, which might be seen
as a summary of the referenced entity
at least from the viewpoint of the referencing entity. Typically
it could be used to provide a summary of a few lines, explaining
the most essential content of the referenced entity, or explaining
why it is referenced.
It can be
shown to users by browsers if desired, typically after a user's explicit
request, or when the user tries to follow the link but this fails for
some reason or another.
Thus, the content could also work as a surrogate for
the referenced entity.
A browser could also provide an option for
printing the contents as footnotes.
A browser could indicate the presence of non-empty content
(e.g. with some icon), so that the user can request for it
before deciding to follow the link.
There are several reasons to
make {Ref}
establish a reference
from its enclosing element.
One of them is the above-mentioned principle of using the content
of a {Ref}
element a potential "fallback".
Another reason is that multilinks are possible in a natural
way: you just put several {Ref}
elements inside an element.
This would mean that the element is a link to several entities;
it should not be confused with several url
attributes in a {Ref}
element, for accessing the
same content in alternate ways.
Using a multilink, you could make a phrase e.g. a link to
to a primer on the topic it refers to and a link to
a page that contains the etymology of the phrase and
a link to a technical reference on the topic.
(A browser could implement them via a popup menu that lets the user
select one of the links, or, trivially, as a list of links with
suitable annotations.)
Moreover, for the entire document, such references
can do what the HTML link
element was once
supposed to do!
In addition to a separate {Ref}
element, you
can use the ref:
addr
attribute in any element.
It corresponds to putting a
{Ref}
element with url:
addr
(and empty content)
inside the element.
Any rel
and other link-related attributes
that are used along with the ref
attribute
then implicitly apply to that
implicit
{Ref}
element.
One possible value of relation
needs special attention:
original
. It indicates that the referenced entity is
the original version upon which the current content is based.
Typically it refers to the original of a translation.
Naturally, a translator program, when asked to translate a document to language X, should check whether the document itself refers to its original which is written in X. This could be important when following links in a manner which goes through translations; it might prevent the situation where the user gets a translation of a document from language Y to X instead of getting an existing original in X!
{Alternative a1 a1 ...}
specifies
a collection of content alternatives, such
two differently worded explanations of the same thing,
alternative drawings,
or an image in different data formats,
so that all alternatives have essentially the same information content.
"Essentially" the same means that the alternatives might have
different amounts of details.
If, for example, the alternatives are written for people with
different background, the {Audience}
markup should be
used inside the alternatives.
This markup should not be used e.g. when the content is a manual
describing alternative ways of doing something, but it would be
appropriate when the document contains several alternative
explanations of one way of doing something.
A browser may display only the
first (presentable) alternative,
or all alternatives in some suitable
way which indicates them as alternative presentations. In the first
case it should, if possible, indicate the presence of other alternatives
and give optional access to them.
Note: This needs to be elaborated. We need to distinguish between different degrees of containing the "same" content.
In particular, the alternatives could be presentations for
different media, such as screen and paper. In that
case, the If
markup should be used for them.
The following example specifies that maps.png
be shown
on screen, mapp.png
on paper, and otherwise
a link to a document (assumably containing some corresponding
content) is given:
{Alternative {If(screen) {Include(maps.png)}}
{If(paper) {Include(mapp.png)}
{{Ref(url:loc.html)} Description of where we are located}
}
Note that in the case above, a browser could still make all the alternatives accessible somehow. But minimally it shows the first alternative in the list that it can present.
Alternatives could also consist of the same information in
two or more languages, in bilingual or multilingual documents or
parts of documents. For example, an image gallery with little
textual content could contain that content in a few languages.
Example:
{Alternative {language(en) Introduction}
{language(de) Einführung}}
.
A browser may select the alternative according to
the language preferences set by the user.
Question: Should the {Alternative}
element have an attribute that specifies what variation is
involved, such as media
or language
?
Question: What about content alternatives, like two essentially different explanations of the same thing?
A {changed}
element indicates that its content
has been changed from a previous version of the document; this
implies that the content replaces some previous content.
The previous content can be specified in a {Deleted}
element which is an immediate constituent of
the {changed}
element.
If a {Deleted}
element is used without
a corresponding {changed}
element, it
indicates deleted content for which there is no replacement,
i.e. pure deletion.
A browser may entirely omit the presentation of a
{Deleted}
element. Thus, this markup should not be
used when it is essential that the old content be shown, e.g.
when comparing different versions of something. On the other hand,
a browser should do its best to give an indication that there is
some deleted content, and make it accessible on user request.
An {inserted}
element indicates that the content is
completely new, inserted in the midst of older content.
These elements should normally have an explanation
attribute that tells the reason of the change, e.g.
{changed(explanation: Some typographic corrections have been
made to this pararaph) ...}
.
Moreover, attributes could be defined for specifying some basic information about changes in a standardized form, e.g. indicating formally whether the change fixes typos, or rewords things, or fixes factual errors, or provides new information, or reflects a change in the state of affairs. For example, in manuals it is often practically very important to distinguish between changes which try to improve the description of a program and changes which reflect changes in the program itself. Those keyword values might indicate document changes due to changes in the reality described (such as changes in a manual of a program due to changes in the program itself), improvement of the wording or structure of the document (by author), error correction (by author), editorial changes (for journalistic reasons), and changes in quoted text (such as omitting irrelevant parts or adding explanatory words) when including quotations. This would allow each user to define suitable presentation styles for those kinds of changes that are important for him to distinguish (in general or in some particular situation). Note that properly used, such markup would greatly facilitate the creation of useful "change logs" or annotated "diff files".
These elements may take the time
attribute,
indicating the moment of time of making the change.
Its value shall be in ISO 8601 conformant notation.
The allowed variation is to be specified, in a manner that
fixes the basic format but allows varying precision.
E.g., time:2001-08
would refer broadly to August 2001.
A browser could indicated changed content with change bars in the margin. A browser could also provide an option of highlighting changes according to user's instructions, e.g. all content that has been changed or added after a user-specified moment of time. Moreover, a browser could keep track of the user's visit to a document and automatically highlight changes since last visit.
{language(lcode) ...}
indicates
that its content is in the human language indicated by the
language code lcode (unless overridden by inner markup).
This element is intended for use when there is no suitable other
element to which
language
attribute can be attached to.
The keyword language
can be abbreviated as
lang
.
The language code normally consists of a two-letter
ISO
639 code optionally followed by a hyphen
and a subcode.
Any two-letter
subcode is interpreted as a country code according to
ISO 3166.
[But see a note on language codes.]
If you use a language
which is actually or fictionally used by human or human-like beings
but to which no language code has been assigned,
you can use a code that begins with x-
, such as
x-klingon
.
Naturally, you cannot expect program
processing your files to recognize such codes, in general,
other than in the negative sense (i.e., the content is not in
any language for which there is a code).
Moreover, the special code unknown
optionally
followed by a hyphen and a subcode can be used to indicate that
the language of the content is unknown to the author.
The special code none
says that the data is not
in any human language but e.g. some special formalism.
This code is, however, included for completeness only, so
that language codes as defined here
can be used outside UTD too.
Normally the code
markup
is used to prevent interpretation of the content according
to the rules for any human language, yet allow a human language
to be implied for pronunciation purposes.
By default, the language attribute
of a document has the value unspecified
(which is not the same as unknown
) and
language attributes are inherited by subelements.
When deciding whether to use language markup or not, an author should consider whether he is willing and capable of using it for the entire content so that all text (including names) in languages other than the main language is marked up to indicate its language. In some cases, it might be better not use any language markup (letting programs use their heuristics to "guess" the languages) than to use explicit markup which not correct for the entire content.
{Quote}
,
or {Q}
for short,
indicates its content as quoted from an
external source. Note: this markup, and not quotation marks, should
be used for all proper quotations.
On the other hand, words and phrases which have been
adopted from other languages but are not quotations from any
specific source,
like "status quo" and
"force majeure" when used in English, should not be
marked up as quotations. (The language
markup
can be used for them.) What markup you use for phrases such as
"Veni, vidi, vici" should depend on whether you present them,
in a particular context, as actually referring to their specific
origin and source, or in a generalized sense, as "flying words".
So what would quotation marks be used then, in the textual content of UTD documents? Apart from special usage, such as text presenting code in a programming language (where you should use characters according to language definition, which by the way usually means that the quotes must not be "smart" but Ascii quotes), they would be used typically to indicate a word as being used in a figurative or otherwise special meaning, basically to say "don't take this too literally", as in referring to paired English quotation marks as "smart". (Maybe we should have markup for this too?)
Browsers are required to indicate quotes as different from normal text in some suitable manner. A browser should analyze the content and context of the element structurally, and typically use (language-specific) quotation marks for inline quotations, some special "quotation block" presentation for quotations containing paragraphs and other blocks. When quotation marks are used, they should be used as dictated by the rules of the language used (in the text containing the quotation!), but in some situations browsers will need to use Ascii quotes (and Ascii apostrophes) as fallback.
Inside a {Quote}
element, any
metadata elements relate to the quoted content;
e.g., an {Author}
element specifies the author of the
quoted text, and a {Source}
element indicates the
source of the quotation.
They are to be shown as part of the document unless enclosed into
a {Meta}
element.
The {Origin}
markup can be used as a container for
origin information. It should be used if that information is presented
in prose containing other text than just the author
and source, to avoid the possibility of having that text
interpreted (in automatic analysis or otherwise) as
part of the quoted text.
{Master(URL)}
indicates the
URL of the master copy of the element (typically, the entire document).
This is especially useful in mirrored copies
and in stored copies of Usenet messages etc.
When a Web browser saves a UTD document it has got from an HTTP
server, it should check for the presence of a
{Master}
element in it, and if there is none,
the browser should insert one, referring to the address from
which the document was got according to HTTP headers (noting
eventual permanent redirections).
{Base(URL)}
indicates the base address for relative URLs in the document.
By default, the base address is the same as the master address,
if specified, otherwise the address of the document itself (if it has
a URL). In the absence of a base URL,
relative URLs will be interpreted in
a system-dependent manner (e.g., as referring to files in the
same folder as the document, when the document is on local disk).
The {Base}
element can be used
e.g. in mirrored copies of a document. Such a copy could e.g.
refer to images and other documents using the same base address
as the master copy, if it is not feasible to create local copies of them.
When asked to save just a UTD document as such, a browser should
insert a {Base}
element
containing the base address of the original.
Question: How about lists of URLs here? Would it be too complicated to specify alternate base addresses to deal with eventual server problems?
Draft
, Niye
, and Metaquestion
{Draft}
indicates its content as being at a draft
level as compared with the rest of the document. (This might be
classified as metadata, but it obviously needs to be merged into
a document body.)
{Niye}
indicates that some content
intended to appear in the final version of the document has not
been written yet. The content of this element should be interpreted as
an explanation of the real content that is to be written,
rather than real content itself.
E.g., {Niye Some examples will be added here.}
.
(The name of the element comes from computer programming jargon and
is short for Not Implemented Yet.)
{Metaquestion}
indicates that the content is not intended
to be final but a question about what the content should be, in
a document in preparation. Example:
{Metaquestion Should we add an example here?}
.
A {definition}
presents a definition for a term or other word
or symbol or abbreviation.
The element shall contain one or
more {definiendum}
elements specifying the term
or other expression being defined (perhaps in different languages
or using synonyms - this is why multiple definienda are allowed).
It may contain one {definiens}
element, in which case
the definition is regarded as "separable": the definiens is the
definition proper for the definiendum, and the rest, if any,
of the content of the
{definition}
element is just "syntactic sugar".
The attributes kind
and status
can
be used to specify the type of definition
and its status (e.g., as tentative).
See Definition: a definition
and an analysis.
Although definition lists are important - e.g., a glossary is basically such a list - there is no separate markup for them. Instead, basic markup for lists and definitions are used. A good-quality browser could present a list where all subelements are separable definitions in a particular, somewhat table-like manner.
For a general discussion on translatability issues, see the document Translation-friendly authoring.
The {poetry}
markup indicates that the content
is in poetic language. This would be a warning to translators:
special problems are to be expected.
A translator which has been primarily coded to handle
prose could even flag its all poetry translations as
uncertain with some relatively low
p
value. The
{poetry}
markup
could also be used by search
engines, e.g. letting people search for poetry in particular,
or perhaps ignoring words marked up as poetry, since typically
you are not interested in "poetry matches" for your search
keywords.
The {translate}
markup indicates that its content
is somehow special in translation. In the simplest case, the
markup would have no attributes and it wouldn't thus say anything
more specific but it could be used by systems for computer-aided
translation: the program could just highlight its translation
of the content, warning the human translator about especial need
for human checking.
Exceptionally, inside a {notranslate}
element, a
{translate}
element simply says that its content
should be translated (normally). (This is technically "somehow
special" in the sense of being different from the parent element's
properties.)
But the markup can also have attributes that
specify the suggested translation in one or more languages.
For example,
{translate(en:she,sv:hon) hän}
indicates that the content, a particular occurrence of the
word "hän", is to be translated as "she"
in English, "hon" on Swedish. (The Finnish word "hän" is
a gender-neutral pronoun, so such a hint might really be needed!)
In a {translate}
element, a translation can
be followed by an asterisk *
to indicate the "default"
translation to be used. This is typically the original form of a
name that is to be used in languages that have no special form for it.
For example, in a document in Swedish, the Finnish city Vaasa could
be mentioned using
{translate(fi:Vaasa*) Vasa}
, since most languages
other than Swedish probably use the Finnish form of the name.
Similarly, in a document in English,
{translate(it:Venezia*) Venice}
says that
a translator should use "Venezia"
in any language other than English
unless it knows, from other sources of
information, that a different translation is to be used in some
language.
It would of course be impractical to try to specify translations of difficult words and phrases into all possible languages. Some translation software might use information like the above in a generalizing manner, e.g. using the English and Swedish equivalents as semantic hints to be used when translating into other languages as well. But mostly you would give translation hints for specific purposes. For example, if you author documents in an environment where you know that they will be automatically translated into another language, you could specify suitable hints for that, perhaps after making a test translation and looking at the translator's reports about ambiguities and other problems.
The notranslate
markup indicates that its content
is to remain invariant in any in translation.
Normally other, more informative markup, namely code
or glossa
, is used for such purposes.
The Glossa
markup indicates that the content
is presented an expression of a language, such as a word,
instead of being used normally.
Example:
The plural of {Glossa ox} is {Glossa oxen}
.
Typically the markup is used for individual words and phrases,
but it could be used for larger parts of a document too, e.g.
for examples in a grammar book.
For obvious reasons, the content of Glossa
should
be invariant in translations; if a grammar on the English language is
translated, the expressions of the "object language" must not be
translated!
Browsers are required to present Glossa
elements in
a manner different from normal text; italics or quotation marks could
be used, for example, but a distinctive font face or background color
are interesting possibilities, too. Authors should not use e.g.
explicit quotation marks for such purposes when they use
Glossa
markup.
Note that {name}
and
{person}
markup can be very useful in translation
but should not be taken as forbidding translation the
same way as
Glossa
does.
For example, when translating from English to another language,
"Homer" usually needs to be replaced by "Homeros" or "Homerus"
when it refers to the ancient Greek poet - but not when it means
Homer Simpson. It is possible to indicate this by using
{person {notranslate Homer Simpson}}
.
Authors should note that commonly known names
often have different forms in different languages, and might
consider indicating deviations from it.
Thus, an author aiming at maximal translatability should write
e.g. {name {notranslate Paris}, {name Texas}}
,
since otherwise "Paris" could be taken as referring to the
capital of France, the name of which has different forms in
different languages.
The {code}
markup, to be
discussed next, is essential
for preventing translation of expressions that might
look like being in a natural language but aren't, such as
names of built-in functions in a text that discusses a "programming
language".
The {code}
markup implies that its content is
invariant in translation, i.e. explicit notranslate
is not needed inside it. Note that there is no such rule
for e.g. person
or name
;
names, even personal names, are sometimes
changed in translation. But naturally a translating program
should translate a name only if it knows a translation for it
as a name.
The code
markup
indicates that the content is not
to be taken as belonging to any natural language, even if it
contains strings that look like words in some language.
The purpose is to let programs process documents so that
e.g. spelling checks are suppressed for data which is actually
a computer program or some other code. The language
attribute still applies in the sense that speech generation is
to be performed according to the pronunciation of some language
when applicable. Thus, for example, a document that explains the HTTP
protocol could mention the {code Referer} field
without fear of having some software fix "Referer" to "Referrer" but
still letting speech synthesizers try to pronounce "Referer"
according to English rules.
The notation
attribute
can be used to specify the
coding system, or "language" (programming language, command
language, data format etc.) of the content.
This might affect rendering and
might be used by automatic checkers, so that a document could be
even processed
so that natural language texts are checked for spelling, grammar, etc.,
and e.g. program samples are checked for syntactic correctness according
to the rules of the programming language.
The values of the notation
attribute are to be defined
along the following principles.
For programming languages, their normal names are used, in lower case
letters; versions can be specified by using a hyphen followed by
version information, e.g. fortran-90
.
The generic value indication mathematical notation is math
,
and it too can be followed by more detailed information.
For chemistry, the value is chem
.
The value binary
indicates textual presentation of
binary data, not to be assumed to be meaningful (e.g., contain
words) even it might by accident look sensible.
The notation
attribute value
quantity
indicates that the content is a
value of a physical quantity expressed using conventional
notation that is language-independent, except possibly
for the use of decimal point vs. comma.
The attribute value si
indicates the same but
additionally says that the expression uses the
SI system.
Examples: e.g. {code(notation:quantity)10 in}
,
{code(notation:si) 254 mm}
.
The markup may help e.g. a speech synthesizer spell out the
expression correctly.
A browser could have a conversion function: clicking on
a quantity denotation could invoke some code (in the browser,
or in a plugin, or on a server)
for interactively converting the quantity to some other units.
Moreover, in visual presentation, such an
expression should not be broken across lines (but an author
could still use a no-break space instead of a normal space
to make this clear even to browsers that do not make use of
the higher-level markup.
The code
markup
typically affects the presentation, too, if the
notation
attribute is specified:
browsers should try to
present the content in a manner which is suitable for
the notation system used. For example, for source programs that
would normally mean using a monospace (non-proportional) font,
perhaps one that gives a specifically "computerish" look.
Note that the typical presentation rules are somewhat different
for mathematics and for chemistry.
The markup {math}
, {chem}
,
{quantity}
, {si}
,
and
{binary}
are shorthand notations for
{code(notation:math)}
etc.
The details of basic mathematical markup are to
be defined. The basics should be as simple as, or preferably
simpler than, in the
HTML 3.0 draft.
For example, that draft said that
the integral from a to b of f(x) over 1+x
would be written as
<MATH>∫_a_^b^{f(x)<over>1+x} dx</MATH>
whereas the UTD notation would be
{math {Integral {Division f(x) 1+x}} x a b}
where the {math}
markup could be omitted (since
its content is a single element which by definition is a math
mode element).
The general idea with math expression is, as usual in UTD, that a wide range of presentation formats can be used, according to technical constraints and quality of implementation. For example, a graphic browser could support conventional mathematical presentations of exponents, integrals, etc. in simple cases and fall back to simplified or even simplistic presentation e.g. at deeper levels of nesting exponents.
Of course, a browser is not expected to
evaluate expressions, or even to simplify them,
except sometimes notationally (e.g. removing unnecessary
parentheses). A browser is not even allowed to
"do math" on the expressions that it is expected to display.
But specialized software for processing data
in {math}
elements in UTD documents could
operate on them. For example, an advanced browser with special
math support could, upon user command, take an expression selected
by the user and use it as input to formula manipulation, numeric
calculations, graph plotting, or other purposes.
An initial small set of mathematical markup:
Division
indicates division; it is expected to be presented two-dimensionally
when possible, as a rule, but
{Division a b}
can be displayed as (a)/(b) if needed
(maybe omitting parentheses if they can be judged unneeded)
Op
indicates that the content is the name of a standard function
such as "sin" or "ln",
or name used as an operator like "det" or "grad",
or equivalent, or generally something that
would
typically be displayed in upright style, whereas otherwise
words inside mathematical markup are assumed to be variables and
displayed in italics-like style
Binomial
indicates a binomial coefficient notation; example:
{Binomial n k}
Sum
indicates an n-ary sum, to be presented as "sigma expression"
when possible; the content consists of four elements:
expression for the general term of the sum,
index variable, lower limit, upper limit; however, if only two
elements are specified, the second one is taken as indicating the
summation range (and to be put below the sigma, in the "two-dimensional"
notation); this explains the order of the elements
Product
indicates an n-ary product (cf. to the above)
Iterated
is a generalization of the two elements above, with the first
sub-element specifying the operator to appear instead of the
sum or product operator
Integral
similar to Sum
, but denotes a definite integral
Integral-function
denotes an indefinite integral; similar to Integral
but has just two sub-elements, the second one specifying the variable
Lim
indicates a limit expression: {Lim a b c}
is the limit of a
as b approaches c
Root
denotes a general root expression, e.g.
{Root 3 1000}
Sqrt
denotes a square root expression, e.g.
{Sqrt 2}
Abs
denotes an absolute value (norm) expression; {Abs a}
could be rendered as abs(a) or |a|
vector
denotes the content as a symbol of a vector;
could be presented using bold face, or with an arrow above
(should this be forcing markup?)
Det
denotes a determinant expression, typically to
be presented so that {Det A}
is displayed as det(A),
but a good-quality browser could use the conventional mathematical
notation if the content is an explicit matrix.
Is the repertoire too rich? For example,
if the {Lim}
markup were not available,
{Lim a b c}
on could write it using more primitive and more presentational
notations:
{Below {op lim} { b {char(2192) ->} c}}
but in addition to being simpler and more natural to write,
the {Lim}
markup can be handled better by software
that does not display math two-dimensionally. It's easier
to design a fallback when you know the specific meaning
of a construct. In particular, even the
"simplistic fallback" works for
{Lim}
relatively well.
An explicit matrix, i.e. with elements presented
in a tabular format, is a special case of a table.
A good-quality brower will display a Table
element in
the conventional matrix style if it appears inside a
{math}
element.
See also phrase level markup, which contains some elements that are useful in math, too. (Question: Should they be moved here?)
Note that in mathematical notations in UTD, preference is given
to the use of ISO 10646 characters for operators, special constants, etc.
Thus, we would denote e.g. the
intersection of sets A and B using the normal infix notation
and the ISO 10646 character for the intersection symbol, i.e.
the notation
{math A {char(2229)} B}
rather than e.g. invent a prefix notation like
{Intersection A B}
. However, for constructs that are
normally displayed "two-dimensionally" rather than as a linear
sequence of characters, e.g. for integrals,
special markup is defined to allow
programs to recognize the structure easily and use a quality
presentation when possible (degrading to some linear notation
otherwise, of course).
The principle "use characters, Luke!" is not applied
to things like overlining, though. The main reason is not the
current lack of support to "nonspacing modifiers" (like nonspacing macron)
but the view that overlining is not essentially a character-level
issue: it applies to an expression as a whole, rather than
the individual characters with which it has been written.
Therefore, we have markup like
{Underline}
and
{Overline}
, where the content is the element to
be underlined or overlined. This is to be interpreted as
logical markup, not just decoration or emphasis:
the underline or overline is expected to have some
specific mathematical meaning (defined by some convention).
Similarly,
{Above x y}
indicates that y is to appear (straight) above x,
for some structural reason defined by some convention.
Note that this is distinct from superscripting and
typically used to place an arrow or special
symbol above an expression.
And {Below x y}
is similarly defined.
For expressions like sin x (the sine of x),
the parenthesized notation should be used:
{math {Op sin}(x)}
.
A browser can omit the parentheses if it can decide that they
are not needed in a particular case and presentation environment.
You could use {code}
to specify that
a string like 3.2
is not to be taken in its
natural meaning in the language of the document but as "pure code",
such as the version number of a program (to be used as such in
any language).
This means that it will not be taken as a number with a decimal point
in English, and consequently it will not be converted to
3,2 e.g. when translating into French. Note that you could explicitly
indicate 3.2 as a number using {number 3.2}
, but in the
absence of any explicit markup, a translating program would
probably treat 3.2
as a number.
The following three elements do not indicate their content as code, but they are related to code markup.
The input
markup indicates that the content
is user's input to a computer program or equivalent. It would
typically be used when describing man-machine interaction.
The markup is not forcing, but it will probably be used mostly
to make it easier to visually distinguish user input from computer
output.
Similarly, output
indicates output from
a computer program or equivalent. Both this and the {input}
element may contain any data, and its logical structure is
to be expressed separately.
The comment
markup is used inside code
markup only, and it indicates that its content is in some
natural language. Note that this element lets authors use simple
markup for program samples containing comments. The
comment
markup
is not for "commenting" UTD markup; UTD is not expected to
be "commented", but if needed, the Rem
markup can
be used.
{example}
indicates that the content presents an
example of something discussed elsewhere in the text. Typically
useful in a textbook,
a technical specification or a theoretical discussion, in
order to distinguish between the various parts of the text.
Often to be accompanied with a style sheet that makes the
examples visually different from normal text.
{Note}
contains a note that relates to
the topic of discussion somehow but breaks the
main flow of thought.
It could be a historical note, remarks on some details,
or a reference to a document that supports the ideas presented.
Can be presented as a footnote,
or in reduced font size and/or line spacing,
or e.g. just in special parentheses.
The markup is forcing
and should be used only when it is adequate, and perhaps necessary,
to explicitly indicate the note as a note, as e.g. in informal
non-normative comments to normative rules in a specifications.
This element can be regarded as general markup for "de-emphasis", or
indicating something as less important
(to an assumed average reader in the intended audience).
The kind
attribute could be used to indicate, in
a formalized way, why the content is less important;
the values of that attribute are to be defined, but they might contain
e.g. detail
, history
,
etymology
, and
cite
.
{omissible}
indicates that the content can,
if needed, be omitted
or made available as secondary content only (e.g., via a link).
This would typically used for documents that are to be published
in newspapers etc., making it easier to fit an article into
some pre-allocated amount of space. An optional
level
attribute
can be used to indicate the order of omissibility;
{omissible(level:1) ...}
is to be omitted before
{omissible(level:2) ...}
. Note that the markup does
not say any specific reason why the content is omissible.
Other markup, such as {note}
and
{example}
, should be used for that.
{Revealable}
indicates that its content must not be
presented by default as part of the document; instead, an indication
of the availability of the content is to be presented.
That is, the content must be accessible to the user but not visible
or audible without an explicit user action.
This is intended e.g. for movie reviews containing "spoilers",
bridge puzzles containing the solution and all hands revealed,
and for material that might not be suitable for all audiences
although the document otherwise is.
An optional kind
attribute can be used to specify
the reason for hiding the content.
The markup is forcing.
It should not be used just to indicate some content
as being of secondary importance.
A graphic browser could implement {Revealable}
by
showing a generic button (e.g. with text like "show hidden content")
which, when clicked on, would cause a page with the hidden content (only)
to be displayed. This would effectively turn the element to a link
to its content. Other implementations are possible too, such as
"toggling" between "show no hidden content" and "show all hidden content"
(as part of the document itself).
{Seq}
indicates sequentiality in
a particular sense: The subelements should be presented to the user
in a sequence so that the first one is presented, and then after an
explicit user action for continuation,
the next subelement is presented, and so on.
This could be achieved using {Revealable}
too, but the
{Seq}
markup is more concise and more natural e.g. for
teaching material that has be designed to be read in portions, or
for jokes intentionally divided into parts.
Non-interactive presentation of a documentation, e.g. printing a
document on a paper, must indicate the intent of hiding or sequencing in
some way, and should simulate it as far as feasible. For example,
on printing, a page eject should normally be generated between the
subelements of {Seq}
.
{Conclusion}
contains
a conclusion drawn from some previous or subsequent discussion.
It can be very useful to present the main conclusions of
a document in a paragraph or section marked up as a conclusion,
either at the beginning or at the end, or sometimes elsewhere.
(In scientific presentations, it is common to put the conclusions at
the end. But it is generally a good idea to put them first,
since they are typically what the reader is most interested in.)
This is not the same thing as an abstract, since an abstract may also include
e.g. a short presentation of the basic line of the reasoning;
it is of course possible to use {Abstract}
markup for
a general summary and {Conclusion}
markup inside it.
The {Conclusion}
element can also be used for
"lower-level" conclusions, such as indicating where some intermediate
conclusions are drawn in a discussion document.
The markup is forcing, since it implies a special kind of strong
emphasis.
{proposal}
indicates that its content presents a suggested or proposed action,
often presented after a lengthy introduction and discussion.
The markup should be used sparingly, especially to distinguish the
concrete proposals from general ideas, motivation, and arguments.
In one bureaucratic style of presentation, such things are indicated
by a ./. mark in a margin, but browsers could use much more advanced
methods of making proposals look prominent and might have a built-in
function for extracting and displaying the {proposal}
element(s) from a document.
{Warning}
has the obvious meaning. The markup is
forcing, and the content is to be presented
prominently in a manner that suggests a warning, e.g. using
red border around the text, or inserting a road traffic warning
sign.
{Important}
indicates relative emphasis on the content,
relative to the emphasis level of the enclosing document.
Thus, the meaning of nested
{Important}
markup
is cumulative; such nesting
should normally be avoided. It should
only be used when there is no adequate other markup that implies
some information on why the content is important,
such as {Conclusion}
or {Warning}
.
For {Important}
as well as for other general
elements that involve some kind of emphasis, a browser should
use presentation style that is suitable for the content as judged
from its inner structure and length. When they contain just small
amounts of text and text-level markup, bolding might be adequate.
But such text is hard to read in large amounts, so especially
if there is a paragraph (or more) in the content, a browser should
use some other method that catches the reader's attention and
makes the content prominent.
It could be distinctive background color, slightly increased font size,
a vertical line in the margin, or a short musical prelude.
When such methods are not available, a browser could use
the "simplistic fallback" or, in quite
a few special cases, the simple method of starting e.g.
a paragraph with some text like "Important: " (or the equivalent
in the document's language).
The {Utterance}
markup indicates that the content
is a statement made by a person (perhaps a role person).
It is typically used in novels and plays, and typically with an
{agent}
element inside it expressing
the subject that made the utterance. Example (abridged):
{Utterance {agent Hamlet} To be or not to be, that
is the question.}
The markup is forcing, but a browser
(and especially a speech generator) might just use different presentation
(e.g., different voices)
for the utterances of different agents, just summarizing the correspondence
between agents and presentations somewhere.
The {Editorial}
markup indicates its content
as editor's notes rather than normal document content. In particular,
when a document has been converted from another format into UTD
format, any notes that explain the decisions and modifications
made in the process should be marked up as editorial.
Similarly, when a document is prepared for publication by a person
other than its author, the editor's own notes should be clearly marked up.
An {Author}
element can be used inside an
{Editorial}
element to specify the editor.
Sometimes there is no adequate generic markup for dividing the document content into different categories but there is some need for making such distinction observable in document presentation. It would not be suitable to rely on presentational suggestions like style sheets when it is necessary that some distinction is made in the visual or aural presentation. "Abstract colors" (or "generic colors") come to rescue then.
The idea is that an author often wishes to mark up some parts of the document as having some special role that cannot be expressed in normal UTD markup. (It might be a role that is common enough to be included into UTD in some later version.) For instance, an author might wish to mark up some parts as normative and some others as explanatory, or, in a description of the features of some computer language, flag some of them as deprecated. In the UTD format, he has seven abstract colors at his disposal. A browser is required to map all the abstract colors, or categories, to some presentation that distinguishes them from each other, from normal text, and from the presentation of other elements. They could be physically presented as background colors, text colors, text fonts, tones of voice, in some uniform way.
Visual browsers are expected to do the mapping usually so that
for each abstract color, a separate, pale but observable background
color is used. But they might use other methods instead of or
in addition to this. As usual, the
"simplistic principle" implies that this
might be as trivial as literally displaying the markup,
as the last resort. (In this case, it would not be sufficient to
display just the {Category}
element; the class
attributes should be shown too.)
To start using an abstract
color, the author uses an element like
{Category(class:classname,title:explanation)}
such as
{Category(class:normative,title:A prescriptive rule)}
After this, any element with the corresponding class
attribute belongs to a category to be presented along the lines
explained above. Style sheets could be used to suggest
particular presentations for categories;
it might make sense to change the colors or other presentation
from user's defaults to reflect the particular purpose for which the
markup is used.
Note that class
attributes themselves do not
introduce categories; they can be used just for optional styling too.
Binding abstract colors to physical presentations
should be user-configurable, preferably dynamically so that the values
of class
attributes (from {Category}
elements)
and associated title
attributes
are displayed to give a hint about the author's
intentions. Such a possibility is not expected to be
used very often, but it could be very useful in special cases,
like customizing the presentation of some extensive material
that the user will view often.
But how can we explain to the user how to interpret the physical presentations? We might be inclined into writing something like "The formal parts are shown with a gray background." This, however, would wire in the particular presentation into the document content. Hence, it is expected that the system responsible for selecting the presentation (typically, either browser defaults or an author's style sheet) will give any explanations on it if needed. For example, a style sheet could contain generated content, like "The formal parts are shown with a gray background" or "The formal parts will be read in this voice". This would make the issuing of such information dependent on whether the particular presentation style is actually used, and this of course is how things should be.
To compensate for the lack of arbitrary markup extensions mechanism, a generic constructor for "records" is defined. This will probably handle most of the extension needs that normal authors would use XML for, with much less theoretical and notational burden than XML has.
In a {List}
or {Row}
element,
the attribute pattern
can be used to denote that
the element is not to be taken as actual list or row but as
describing a type, or general content pattern, of lists or
rows. This attribute may have a value that becomes then the
name of the pattern. The values in the list or row are
names of elements or patterns optionally
preceded by a name and a colon.
Typically element names like
string
, integer
,
number
and time
are used.
The main purpose of this construct is to let authors specify
the data types of cells in tables, so that columns can be suitable
formatted (aligned) even without explicit stylistic
suggestions.
For example, in a simple table where the first column contains
names of countries and the second column contains
the population numbers, a row like
{Row(pattern) country:string population:integer}
would in practice make a browser present the first column
left aligned, the second column right aligned.
Browsers and other programs that process UTD documents might also
make simple syntax checks, or "type checking",
based on such information, detecting
typos and other problems.
Moreover,
the names country
and population
could be used in style sheets, script codes, etc., for referring
to the cells in natural ways, rather than e.g. via column numbers.
{Include(URL) fallback}
specifies that the content at the specified URL is to replace
the Include
element. If that content is of type
text/utd
, it is inserted literally as such.
If the type is text/plain
, it is inserted as such
but with surrounding {Text(plain) ...}
markup implied, i.e.
the content is taken as plain text to be displayed as such.
For other content types, inclusion logically means that the
document is to contain the content of the referred document
as embedded, as smoothly as possible.
If the inclusion fails, for one reason or another, the
fallback content is used instead. If that content is
omitted, a system-dependent default error message is inserted
instead; the author could override this by explicitly specifying
the empty element {}
as the content.
Question: Should we have a general "transclusion" element
(instead of Include
and Ref
)
which can be implemented as inclusion, as embedding, or
as a link? This would raise serious legal issues.
For example, assuming that the relative URL
photo.png
refers to an image file, the element
{Include(photo.png) Finland is a country with
thousands of lakes.}
is preferably displayed by
presenting the image in its place. A browser should still make the
fallback content accessible to the user upon special request.
If such presentation is impossible for one reason or another,
the browser should display the fallback content instead, using
some presentation style that suggests that is a replacement for
something, and giving the user the option of accessing the image
file somehow (minimally, by downloading it for processing in a user-specified
manner). Note that if the fallback content would be long, e.g.
because the verbal explanation of the information content of an image
would be verbose, you could put it into a separate file and link to it,
or use {Include}
for it.
Generally, an included (embedded) resource other than
text/utd
or text/plain
will be presented
in a separate box allocated for the purpose. The box could contain
an image, or a movie, an animation with some controls, a spreadsheet
(with or without controls), etc. It could also be audio data, in
which case it should be regarded as parallel to the smallest
enclosing element but starting at the point where the
Include
element appears.
The Include
element can contain a
Caption
element, which specifies content to be closely
associated with the embedded data, typically a textual caption,
to be shown below it or in some other suitable way.
If actual embedding does not take place, the display of the
Caption
element should be suppressed, but the content
of that element should be made available to the user when he
exercises the access option described above. Thus, for example,
on a text-only browser, image captions should be suppressed, but
if the user asks the browser to lauch an external image viewer,
the caption content should be shown too, in some suitable manner.
{Define id expansion}
is
not to be presented or otherwise
treated as data but as a simple constant definition:
any subsequent occurrence of {id}
is to be (literally) replaced by the expansion.
This may result in redefining an element name. Thus, authors are
recommended to use id names that begin with a capital
letter. (They might still clash with some predefined names, but
this usually shouldn't be a problem.)
Typical use for constant definitions is for URLs or parts of URLs
that occur frequently in a document.
The mechanism could be used for repeated textual content as well.
The {If}
markup
is (basically) not for
preprocessing but for specifying logical conditionality of content.
It has a required attribute that specifies a condition, typically
relating to possible presentations of the document. For example,
{If(screen) ...}
means that the content ...
is relevant only when the document is viewed on a screen and should
be omitted e.g. when presenting it on paper or in speech.
The condition syntax and semantics are to be defined, perhaps
similarly to CSS2 "media" concepts.
It would be possible to let a UTD document contain e.g. image data, using something like
{data(foo.gif) Content-Type: image/gif ... GIF data here ... } {data(my.css) Content-Type: text/css heading { font-size: 1em; } }
This might be convenient
(and might contribute to efficiency of data transfer)
especially when such data sets are
small.
However, it is probably better to solve the problem of
merging data types at a different level, e.g. by using
a multipart
data type.
Comparing the above proposal with various HTML specifications and implementations, you probably note that there's something missing.
First, UTD lacks presentational markup,
like a counterpart to the HTML element
<font>
and the HTML attribute
align="center"
. This is intentional.
The basic reason is separation of structure from presentation.
You are expected to make presentational suggestions (if desired)
in an attached style sheet, or otherwise outside UTD.
Moreover, UTD has better logical markup, from which suitable default
presentations can be selected by browser. For example, a program
processing a table in UTD can know that some column contains
integers and could thus select right alignment without any explicit
presentational hint.
However, it could be argued that in special cases some presentational markup might be desirable, especially when quoting material from printed source or otherwise in fixed format. For example, consider the problem of quoting some text containing words in italics; if you cannot know what the italics means, you can't really use suitable logical markup for those words. So you might like to say "this is in italics, for some reason I don't know (though I have some guesses, perhaps)". Some font-level markup wouldn't really harm anyone if it were used only when appropriate; but that's a big "if". It's probably better to let authors make even such presentational suggestions outside UTD.
UTD has no counterpart to HTML forms. Forms are
important, but they are best handled outside a general text data format.
This becomes obvious if you think what happens when you print
an HTML document containing a form.
How do you click on a submit button on paper?
But the general idea of submitting
some data to some processing could be formulated so that it
makes sense to print a form.
This would be especially relevant when it makes sense to fill out
the form either on screen or on paper.
This would mean that the printed version contains information that
is not present in the version displayed on screen, and vice versa;
the printed version should minimally be accompanied with instructions
on where to send it after filling it out.
However this requires special consideration
and is probably best done as a separate effort. From the UTD perspective,
a form could be a separate document type, perhaps embedded into a UTD
document via {Include}
.
But we might also consider
adding an element like {submission}
, for the
logical specification of the format of data to be entered
and submitted to processing.
The UTD format itself has no provisions for image maps. At the logical level, a client-side image map should be viewed as a list of (textual) links, with an optional graphic presentation alterantive where the links appear as areas of an image. A server-side image map relates to interactive user interface issues and should probably be considered in conjunction with forms. However, such interface issues are beyond the scope of a data format like UTD.
Client-side scripting is to be handled outside UTD.
There is no particular reason to embed scripts into UTD or have
"event attributes" as in HTML. Instead, a UTD document can be
accompanied with a separate script file or a combination of script
code and special code for interfacing it with the UTD document.
For example, instead of an onmouseover
event attribute
attached to a particular element, that element should be referred to
in script code, either using its id
value or
some "selector" which identifies the element by its position in
the logical document tree.
Similar considerations apply to style sheets.
Mixing style sheets and scripts with HTML markup is a frequent
cause of confusion in HTML authoring. Author's presentational
suggestions on a document shouldn't be even treated as something
to be referred to in a document, still less embedded to it.
This does not exclude the possibility that a Web server, when
asked to send a UTD document, would automatically send the author's
recommended style sheet (or a reference to it) along with the document.
But this should be handled outside UTD, e.g. (to take a trivial example)
so that when foo.utd
is requested for, the server
checks for the existence of foo.css
and sends
an HTTP header (to be defined in an extension to the HTTP protocol)
that suggests the use of that resource as a style sheet.
The dir
attribute is missing too. The reason is that
directionality is to be handled at the character level, as
specified in Unicode, without adding extra complexity to the markup.
The difficulty of implementation shouldn't essentially depend on this,
especially since software processing UTD is required to be minimally
"Unicode capable".
What's the relationship between UTD and HTML otherwise?
Briefly, they are different formats, containing different types of
markup that cannot be mapped to each other in all cases. But a limited,
disciplined subset of HTML documents could be automatically converted
to UTD. For example, an HTML document that uses heading elements
consistently has an implied division into parts that can be mapped to UTD.
In the opposite direction, complete conversion is possible if we
accept the fact that most UTD markup has no direct HTML counterpart
but would be simulated using classes and CSS or expressed in textual
content with explicit explanations that somehow correspond to
"simplistic visual presentations of UTD.
For example, a {Note}
element could be mapped to
<div class="note">...<div>
markup
and a style sheet for .note
, but to make sure that
the information about this being a note is really transmitted,
the conversion might prefix the content of the note with something like
"Note:" or "Bemerkung:" or "Huomautus:", depending on the language
of the document of course.
This proposal is far from a rigorous specification. Almost all of it needs some clarification and better formulations. In addition to this, some topics would need to be added, topics that are important but relatively straightforward and/or orthogonal to the design as a whole. This applies to things like
As a practical question, if UTD were adopted in the near future, we should define the mappings between UTD and HTML, i.e. how HTML could be converted to UTD and vice versa, to the extent possible. And a UTD to HTML(+CSS) converter, or principles of such software, might anyway work as an illustration of how UTD is radically different.
This afterword (or should we call it postface?) briefly explains the real reasons for writing this document.
I have been thinking about these issues for years. When I left for the first World Wide Web conference (in 1994), I had written myself a list of questions to which I would look for answers. Well, I really didn't get answers there, or even later, mostly. I have collected various notes on how HTML should be changed, or should have been designed. Sometimes I have submitted some of them to public discussion, e.g. in my review of the HTML 4.0 draft in 1997, sometimes I just wrote incomplete drafts like HTML Redesigned: HMM (HyperMedia Markup language), informing some people about them. And I have been disappointed by the general lack of understanding of what I am really trying to say and why it is important. I know some of the reasons behind that, including my inability to make my points briefly and to "sell" ideas, but I think the main reason is still that the world is not yet ready for the idea of universal semantic markup. I don't expect that the situation is much better now.
But in spring and summer 2001, during my sort-of sabbatical half-year, I started composing this somewhat unified presentation of my ideas, trying to put things together. This inevitably means making some decisions on details that really don't matter that much, just to be able to present things uniformly and somewhat understandably. It would still need quite some polishing, clarifications, restructuring and associated formalized definitions and code samples. But I think that now, 2001-08-31, it's as complete as it will ever be, by me. I hope others will continue the work, some day.
Having virtually finalized this document, I became aware of the work done in the Electronic Manuscript Project by the Association of American Publishers (AAP) and especially of the international standard based on that work, ANSI/NISO/ISO 12083, Electronic Manuscript Preparation and Markup (a successor to ANSI/NISO Z39.59-1988 and available for free download at NISO). I haven't studied it in much detail, but there's much in common (and there are quite a lot of interesting ideas and details worked out in the standard), and a fundamental difference: UTD is not limited or even especially oriented towards publication on paper, though it could certainly be used in publication processes too. It needs to be studied in detail which aspects of paper publishing should be reflected in logical markup and which of them should be delegated to other "protocol levels", such as style sheets or equivalents.
I don't expect that the idea presented here will be widely understood, accepted and implemented in the near future. My most optimistic estimate is that it might happen around 2020, but it could well take a much longer time. Anyway, it is nice to think that in 2020, or 2200, or later, some people might read this and find it useful, or entertaining, or historically interesting against the background that something that more or less resembles UTD has been or is being adopted as a universal format for structured text data. I wonder what hype words will be used about it, or for it.
So in the present world, was it futile to write this, and to read it? After all, who cares of what happens in 2020, especially if you can't be quite certain about it? Well, now I understand better the needs, the problems, and potential solutions of text document structuring, and if you are reading this, so do you. And we might find some applications for the principles and ideas in various contexts, e.g. when designing "XML based" markup for something, or the use of classes and style sheets to achieve some quasi-structurality, or designing various other document formats that can be used in present-day systems. If you work on a browser, or with a well-customizable browser, you might have found some ideas on how browsers could present some document structures. There's quite a lot one could do even with the primitive markup used our days.
Note 2002-04-14:
Markup for validity constraints
could be added, e.g. {Valid(time:2020/,country:FI) ...}
would indicate that the content is to be regarded as valid (true,
applicable) only in Finland from year 2020 onwards. A browser should
present the content with an explanation of the validity constraint,
unless it can evaluate whether the constraint applies, in which
case it would present the content normally (if the constraint is true)
or omit it (if the constraint is false).
Note 2002-08-01: Elements that describe the intended or recorded style of presentation should be added, for use when such information is an essential part of the content, as e.g. in play manuscripts. In fact, in traditional play manuscript style, notes like "(angrily)" are comparable to markup.
Note 2002-08-05: The language code system discussed in the text reflects a naivistic approach, which does not take into account the fact that ISO 639 covers only a small subset of the languages of the world and does not address the issue of indicating dialect, sociolinguistic forms, style, etc. For a discussion of such issues, see the extensive document Language Identifiers in the Markup Context or my document in Finnish: Kielimerkkaus.
Note 2002-08-08: We might add "character markup" in the sense that we recognize conventions that are often used in plain text data, such as interpreting underline characters or asterisks as "start tags" and "end tags" for emphasis. For example, _foo bar_ might get interpreted as {Emphatic foo bar}. This would help in gradual move from plain text to richer format, or could even be used by authors who prefer such shorthand. Conventions would be needed for enabling and disabling this feature.
Note 2002-12-30:
Sometimes it is the actual spelling of some text
that is essential. For example, the statement
"The term Web site is often written as
web site or website" makes little sense if read
aloud the simple way. It should be spoken in a manner that gives
the exact spelling (for example, by naming the letters and their case and the
use of spaces), or at least the user should be informed that it is
the specific written form that is essential here.
Markup (e.g., {spell
text}
)
could be introduced to express this.
Note 2004-09-26: Sam Hughes suggested, in an E-mail message:
If a UTD browser were ever made, perhaps it should randomize many of its default rendering settings, so that using markup to achieve presentational goals is not possible.
I think that would be desirable indeed, especially if combined with a dialogue that a browser initiates upon installation. A browser should prompt for user choices in a few fundamental issues, such as font size and face, offering a simple menu, with instant preview. Many of the great controversies of Web authoring stem from the fact that Web (HTML) browsers do no such things, so that we cannot realistically expect that people are using settings that are optimal to them. p>
Note 2005-12-10: Changed the name
Hidden
to Revealable
, which is more generic
and suggests the dual semantics: hidden by default but available to
the user by special request.