A Layman's View of XML, Part 6

by Richard G. Baldwin
baldwin@austin.cc.tx.us
Baldwin's Home Page

Dateline: 12/17/99
Editorial Comments

Before getting into the details of this article, I would like to make a couple of preliminary remarks that have very little to do with the topic of the article.

WYSIWYG HTML Editor: I always like to give credit where credit is due. I don't enjoy suffering through the pains of generating raw HTML using a raw HTML editor. So, I prefer to use a WYSIWYG editor. In addition, I am one of those people who benefits greatly from a good spelling and grammar checker.

Word 97 editor has problems: For a long time, I have suffered through the problems exhibited by the HTML editing capability of Microsoft Word 97. (Among other problems, that Word 97 likes to cause angle brackets to disappear. That is not a good thing for someone who writes articles on XML and also writes articles containing Java code.) I have continued to use it, not because it is a good HTML editor, but because it has an excellent spelling checker and a good grammar checker.

Office 2000 to the rescue: I recently purchased Microsoft Office 2000, and have begun using Word 2000 as my HTML editor. I just can’t say enough good things about it. Although I have encountered a few problems, Word 2000 makes it almost as easy to produce HTML documents as to produce ordinary Word documents.

Embedded XML? Actually, there is more to this discussion than a blatant plug for Microsoft. This document is being produced using Word 2000. When I view the HTML source for the document, I see the following near the beginning of the document.

<html xmlns:v="urn:schemas-microsoft-com:vml"

xmlns:o="urn:schemas-microsoft-com:office:office"

xmlns:w="urn:schemas-microsoft-com:office:word"

xmlns="http://www.w3.org/TR/REC-html40">

As you can see, the letters XML appear four times in this excerpt. In addition, when I view the source, I find XML appearing at various other places in the document, having nothing to do with the fact that the subject of the document is XML.

What does this mean? Has Microsoft embedded some XML in this document, or is this just an unfortunate use of the letters XML? Honestly, at the moment, I don’t know. However, I will take it as a sign that in some way, Microsoft is making use of XML in this new version of their WYSIWYG HTML editor.

And if that’s what makes it work so well, I’m all for it.

Now, I will get back to the main topic of this article.

Prolog

This is the sixth in a series of articles explaining XML in layman's language, being particularly careful to avoid the use of technical jargon.

The first article in the series provided the following brief definition of XML:

XML gives us a way to create and maintain structured documents in plain text that can be rendered in a variety of different ways.

Then the article proceeded to break down the jargon into plain English and provided some examples of structured documents.

In the previous articles in this series, I have discussed tags, elements, content, and attributes in detail. Last week, I promised to take up well-formed documents, valid documents, and the DTD. in this article.

What is a DTD?

According to the FAQ:

"A DTD is usually a file (or several files to be used together) which contains a formal definition of a particular type of document. This sets out what names can be used for elements, where they may occur, and how they all fit together. For example, if you want a document type to describe <LIST>s which contain <ITEM>s, part of your DTD would contain something like

<!ELEMENT item (#pcdata)>

<!ELEMENT list (item)+>

This defines items containing text, and lists containing items.

It's a formal language which lets processors automatically parse a document and identify where every element comes and how they relate to each other, so that stylesheets, navigators, browsers, search engines, databases, printing routines, and other applications can be used."

But, I thought you said no technical jargon!

Sorry about that.

I decided to stick the above material in to emphasize one very important point – DTD’s are complicated. In fact, in my opinion, DTD’s are perhaps the most complicated aspect of XML.

The bad news!

The bad news is that the creation of a DTD of any significance is a very complex task.

The good news!

The good news is that many of you will never need to worry very much about DTD’s for two reasons:

In the most fundamental sense, XML does not require the use of a DTD.
Even when it is advisable to use a DTD with XML, someone else may already have created the DTD on your behalf.

Three Parts

It is fairly reasonable to think of an XML document of consisting of three parts, some of which are optional. I’m gong to refer to the parts as files just so I will have something to call them (but they don't have to be separate physical files).

One file contains the content of the document (words, pictures, etc.). This is the part that the author wants to expose to the client. I have discussed this in previous articles.

A second file is the DTD, which meets the definition given above.

A third file is a stylesheet that establishes how the content that conforms to the DTD is to be rendered on the output device. This is how the author wants the material to be presented to the client.

Rendering

For example a tag with an attribute of "red" might cause something to be presented bright red according to one stylesheet and dull red according to another stylesheet. (It might even be presented as some shade of green according to still another stylesheet.)

With XML, the DTD is optional but the stylesheet (or some processing mechanism that substitutes for a stylesheet) is required. Something has to be able to render the content in the manner that the author intended it to be rendered.

A DTD can be very complex

Again, according to the FAQ:

"... the design and construction of a DTD can be a complex and non-trivial task, so XML has been designed so it can be used either with or without a DTD. DTDless operation means you can invent markup without having to define it formally.

To make this work, a DTDless file in effect `defines’ its own markup, informally, by the existence and location of elements where you create them. But when an XML application such as a browser encounters a DTDless file, it needs to be able to understand the document structure as it reads it, because it has no DTD to tell it what to expect, so some changes have been made to the rules."

Without the technical jargon please

In other words, it is entirely possible to create an XML document without the requirement for a DTD.

However, in some situations, you may find yourself facing the requirement to create an XML document that meets specifications that someone else has devised.

Hopefully, in those cases, the person who devised the specifications has also created a DTD and can provide it to you for your use.

What is a valid document?

In the normal sense of the word, if something is invalid, that usually means that it is not any good. However, that is not the case for XML. An invalid XML document can be a perfectly good and useful document.

A valid XML document is one that conforms to an existing DTD in every respect.

In other words, unless the DTD allows a tag with a name of "color", an XML document containing a tag with that name is not valid.

However, because XML does not require a DTD, an XML processor cannot require validation of the document. Many very useful XML documents are not valid, simply because they were not constructed according to an existing DTD.

What about well-formed?

The concept of being well-formed was introduced as a requirement of XML, apparently to deal with the situation where a DTD is not available.

Again, according to the FAQ:

"For example, HTML's <IMG> element is defined as `EMPTY': it doesn't have an end-tag. Without a DTD, an XML application would have no way to know whether or not to expect an end-tag for an element, so the concept of `well-formed' has been introduced.

This makes the start and end of every element, and the occurrence of EMPTY elements completely unambiguous."

All XML documents must be well-formed

Let me say it again. XML documents need not be valid, but all XML documents must be well-formed.

To be well-formed...

To be well-formed, all elements that can contain character data must have both start and end tags. What is character data? For purposes of this explanation, let’s just say that the content that we discussed in an earlier lesson comprises character data.

All attribute values must be in quotes (apostrophes or double quotes). You can surround the value with apostrophes (single quote) if the attribute value contains a double quote. An attribute value that is surrounded by double quotes can contain apostrophes.

Dealing with empty elements

EMPTY elements (those that contain no character data) must be written in one of the following two ways:

<foo/>

(Note that an EMPTY element can contain one or more attributes inside the start tag.)

No markup characters are allowed

For a document to be well-formed, it must not have markup characters (< or &) in the text data. If such characters are needed, you can represent them using < and & instead.

Nesting

Elements must nest properly. If one element contains another, the entire second element must be defined inside the start and end tags of the first element.

Validity and well-formed requirements, recap

Valid XML files are those which have a DTD and which conform to the DTD.

XML files must be well-formed, but there is no requirement for them to be valid. A DTD is not required in which case validity is impossible to establish. However, if XML documents have a DTD, they must conform to it, which makes them valid.

Why use a DTD if it is not required?

There are probably many reasons to use a DTD, in spite of the fact that XML doesn’t require one.

Enforcing format specifications

Suppose, for example, that you have been charged with publishing a weekly newsletter, and you intend to produce the newsletter as an XML file. Suppose also that you occasionally have a guest editor who produces the newsletter on your behalf.

You will probably establish a set of format specifications for your newsletter and you will need to publish them for the benefit of the guest editors. However, simply publishing a document containing format specifications does not ensure that the guest editors will comply with the specifications.

You can enforce the specifications by also establishing a DTD that matches the specifications. Then, if either you, or one of your guest editors produces an XML document that doesn’t meet the specifications, the XML processor that you use to render your newsletter into its final form will notify you that the document is not valid.

Improved parser diagnostic data

Another reason that I have found a DTD to be useful goes as follows. I am occasionally called upon to write a Java program that will parse and process an XML document in some fashion. My experience is that the parsers that I have used are much more effective in identifying XML structural problems when the XML document has a DTD than when it doesn’t. This tends to make it easier to repair the document.

Coming attractions...

I believe that just about wraps it up insofar as this series of articles is concerned. I hope that some of you have found this layman’s view of XML to be useful.

I haven’t decided yet where to go from here. There is no shortage of material. I just need to decide which material would be of most interest.

The XML octopus

Trying to wrap your brain around XML is sort of like trying to put an octopus in a bottle. Every time you think you have it under control, a new tentacle shows up. XML has many tentacles, reaching out in all directions. But, that's what makes it fun. As your XML host, I will do my best to lead you to the information that you need to keep the XML octopus under control.

Credits

This HTML page was partially produced using the WYSIWYG features of Microsoft Word 2000. The images on this page were used with permission from the Microsoft Word 97 Clipart Gallery.

About the author

Richard Baldwin is a college professor and private consultant whose primary focus is a combination of Java and XML. In addition to the many platform-independent benefits of Java applications, he believes that a combination of Java and XML will become the primary driving force in the delivery of structured information on the Web.

Richard has participated in numerous consulting projects involving Java, XML, or a combination of the two. He frequently provides onsite Java and/or XML training at the high-tech companies located in and around Austin, Texas. He is the author of Baldwin's Java Programming Tutorials, which has gained a worldwide following among experienced and aspiring Java programmers. He has also published articles on Java Programming in Java Pro magazine.

Richard holds an MSEE degree from Southern Methodist University and has many years of experience in the application of computer technology to real-world problems.

baldwin@austin.cc.tx.us
Baldwin's Home Page

-end-

rev9912231125