What is XML, Part 3

by Richard G. Baldwin
baldwin@austin.cc.tx.us
Baldwin's Home Page

Dateline: 05/16/99

prolog
In the previous article in this series, I promised you that this article would jump right into the hard-core technical material, and provide some definitive answers to the question.

so, what is XML?
My answer the question will be based heavily on information obtained from
The XML FAQ. You are strongly encouraged to visit that site for additional and more detailed information. This article consists primarily of excerpts from that document with a few of my own comments inserted between the excerpts.

the alphabet soup
XML is an acronym for "Extensible Markup Language".  It is considered to be extensible because it does not use a fixed format like HTML. Neither the
DTD nor the stylesheet for an XML document is fixed.

from whence does XML derive?
XML derives from SGML.  Like SGML, XML is a metalanguage designed to make it possible for you to design your own customized markup language.  In fact, XML is a subset of SGML, with a couple of differences.  XML is designed such that an XML document with a DTD can be processed using an SGML processor.

what is the purpose of XML?
The purpose is to provide many of the capabilities of SGML in such a way that XML documents can be used on the Web in much the same way that HTML documents are now used.

According to the FAQ mentioned above:

"XML is an abbreviated version of SGML, to make it easier for you to define your own document types, and to make it easier for programmers to write programs to handle them. It omits the more complex and less-used parts of SGML in return for the benefits of being easier to write applications, easier to understand, and more suited to delivery and interoperability over the Web. But it is still SGML, and XML files may still be parsed and validated the same as any other SGML file ... "

should I care?
Yes, if you are interested in web development.

XML is very important to Web development because it will allow the development of user-defined document types on the Web.  It breaks the mold of fixed-format document types imposed by HTML while eliminating much of the complexity associated with SGML.

but, you say, creating a DTD is difficult
True! Creating a DTD is fairly difficult. However, a DTD may not be necessary.

While XML makes SGML simpler, it retains the abilities of SGML that let you define your own document types.  In addition, unlike SGML, XML introduces a new feature that eliminates the requirement for a specific DTD.  While XML makes it possible for a document to have a DTD, this option also makes it possible for a document to define its on type "on the fly"  without a requirement for a DTD.

This leads us to introduce two important terms (valid documents and well-formed documents) which I will discuss later.

how universal is XML?
According to the
FAQ:

"XML will allow groups of people or organizations to create their own customized markup languages for exchanging information in their domain (music, chemistry, electronics, hill-walking, finance, surfing, linguistics, mathematics, knitting, history, engineering, rabbit-keeping etc). 

HTML is at the limit of its usefulness as a way of describing information, and while it will continue to play an important role for the content it currently represents, many new applications require a more robust and flexible infrastructure." 

caution: XML is case sensitive
A very important point for those of you who may be interesting in migrating from HTML to XML is that although XML syntax is very similar to HTML syntax, XML is case sensitive.

Case sensitivity has historically been a stumbling block for many aspiring C and C++ programmers and it could be just as troublesome for you.  (It is only a stumbling block because the language that those programmers were taught originally was not case sensitive, Of course, HTML is also not case sensitive.) In short, you need to be concerned about the case of everything that you enter into an XML document.

can I convert an HTML document to an XML document?
You should be able to convert an existing HTML document to an XML document (with no DTD) by making it well-formed and by making a few other changes. This isn't difficult. You can view an
example of converting an HTML document to an XML document in my online Java tutorials.

the old angle bracket problem
You will need to use &lt; and &amp; in place of the left angle bracket "<" and the ampersand "&", but many of you are already doing that anyway.

attributes must be in quotes
You need to make certain that all attribute values are in quotes.  You may or may not be doing that now.  Browsers allow HTML authors to be very sloppy in this regard.

remember case sensitivity
Perhaps most difficult of all, you must make certain that element names in start tags and end tags match with respect to upper and lower case.

help is available
Fortunately, parser software is available to help you find problems of this sort in your XML documents.  Once you learn to recognize the diagnostic messages produced by parser software, fixing the problem is usually relatively easy.

one more requirement
Also, according to the
FAQ:

"If you have created your HTML files conforming to one of the several HTML Document Type Definitions (DTDs), and they validate OK, then they can be converted as follows: 

Replace the DOCTYPE declaration and any internal subset (basically everything within the first set  of angled brackets 

<!DOCTYPE HTML...>) 

with the XML Declaration 

<?XML version="1.0"   standalone="yes"? >"

compatibility with SGML?
One question that may be of interest is: Once you have an XML document, can you use it directly with an SGML processor?  The writers of the
FAQ give a qualified yes to this question.  According to the FAQ, the answer is yes, but:

"At the moment there are few tools which handle XML files unchanged because of the format of these EMPTY elements, but this is changing. The nsgmls parser has an experimental XML conformance switch, and the first XML-specific editors and parsers are appearing (see the question on software)." 

back to the Document Type Definition
Throughout this and the previous article, I have been mentioning a Document Type Definition (DTD).  It's time to provide a credible definition for a DTD.  According to the
FAQ:

"A DTD is usually a file (or several files to be used together) which contains a formal definition of a particular type of document. This sets out what names can be used for elements, where they may occur, and how they all fit together. For example, if you want a document type to describe <LIST>s which contain <ITEM>s, part of your DTD would contain something like 

    <!ELEMENT item (#pcdata)> 
    <!ELEMENT list (item)+> 

This defines items containing text, and lists containing items. 

It's a formal language which lets processors automatically parse a document and identify where every element comes and how they relate to each other, so that stylesheets, navigators, browsers, search engines, databases, printing routines, and other applications can be used." 

Now let's take a closer look at the terms DTD, valid, and well-formed.

the three parts
As mentioned in the
previous article, an SGML document is really the combination of three parts. Let's refer to the parts as files just so I will have something to call them (but they don't have to be separate physical files).

One file contains the content of the document (words, pictures, etc.).  This is the part that the author wants to expose to the client.

A second file is the DTD, which meets the definition given above.

A third file is a stylesheet that establishes how the content that conforms to the DTD is to be rendered on the output device.  This is how the author wants the material to be presented to the client. 

For example a tag with an attribute of "red" might cause something to be presented bright red according to one stylesheet and dull red according to another stylesheet.  (It might even be presented as some shade of green according to still another stylesheet.)

the two extremes
With HTML, the DTD and the stylesheet are essentially hard-coded into the browser.  With SGML, the processor requires both a DTD and a stylesheet.

XML, the middle ground
With XML, the DTD is optional but the stylesheet (or some processing mechanism that substitutes for a stylesheet) is required.

According to the FAQ, XML does not require a DTD:

"Full SGML uses a Document Type Definition (DTD) to describe the markup (elements) available in any specific type of document. However, the design and construction of a DTD can be a complex and non-trivial task, so XML has been designed so it can be used either with or without a DTD.  DTDless operation means you can invent markup without having to define it formally. 

To make this work, a DTDless file in effect `defines' its own markup, informally, by the existence and location of elements where you create them. But when an XML application such as a browser encounters a DTDless file, it needs to be able to understand the document structure as it reads it, because it has no DTD to tell it what to expect, so some changes have been made to the rules." 

now for valid
A valid document is one that conforms to the DTD in every respect.  In other words, unless the DTD allows a tag with a name of "color", an XML document containing a tag with that name is not valid.  SGML processors require the document to be valid.  Because XML does not require a DTD, an XML processor cannot require validation of the document.

what about well-formed? 
As I understand it, being well-formed is not a requirement of SGML.  Rather it was introduced as a requirement of XML, apparently to deal with the situation where a DTD is not available.

According to the FAQ:  

"For example, HTML's <IMG> element is defined as `EMPTY': it doesn't have an end-tag. Without a DTD, an XML application would have no way to know whether or not to expect an end-tag for an element, so the concept of `well-formed' has been introduced. This makes the start and end of every element, and the occurrence of EMPTY elements completely unambiguous." 

all XML documents must be well-formed
This is worth repeating. All XML documents must be well-formed. According to the
FAQ, if there is no DTD in use, the document must start with a Standalone Document Declaration (SDD) that looks like the following:

<?XML version="1.0" standalone="yes"?> 

(However, a couple of XML parsers that I have used don't seem to enforce this requirement.)

to be well-formed...
To be well-formed, all elements that can contain character data must have both start and end tags.

All attribute values must be in quotes (apostrophes or double quotes).  You can surround the value with apostrophes (single quote) if the attribute value contains a double quote.  An attribute value that is surrounded by double quotes can contain apostrophes.

dealing with empty elements
EMPTY elements (those that contain no character data) must be written in one of the following two ways:

<foo/> 
<foo></foo>

(Note that an EMPTY element can contain one or more attributes inside the start tag.)

a subtle problem
The first example shown above is probably preferable because parsing problems can arise for an empty element when the start tag is on one line and the end tag ends up on the next line as shown below:  

<foo> 
</foo>

The problem here is that this element isn't really empty.  It contains a newline character.

markup characters, keep out
For a document to be well-formed, it must not have markup characters (< or &) in the text data.  If such characters are needed, you can represent them using &lt; and &amp; instead.

The sequence ]]> must be written as ]]&gt; if it does not occur as the end of a section marked as CDATA.

According to the FAQ:

"Well-formed files with no DTD may use attributes on any element, but the attributes must all be of  type CDATA by default. 

Well-formed XML files with no DTD are considered to have &lt;, &gt;, &apos;, &quot;, and &amp; predefined and thus available for use even without a DTD. 

Valid XML files must declare them explicitly if they use them." 

Again, some of the parsers that I have used don't seem to require that these items be predefined even with a DTD.

nesting
Elements must nest properly. If one element contains another, the entire second element must be defined inside the start and end tags of the first element.

validity and well-formed requirements, recap
Valid XML files are those which have a DTD and which conform to the DTD.

XML files must be well-formed, but there is no requirement for them to be valid. A DTD is not required in which case validity is impossible to establish. However, if they have a DTD, they must conform to it, which makes them valid.

where to find the DTD if used
A valid XML file has a statement similar to the second statement in the following example as the first or second statement in the file (excluding comments):

<?XML version="1.0"?> 

<!DOCTYPE myDocument SYSTEM "http://HostName/FileName.dtd"> 

This statement is known as a Document Type Declaration (as distinguished from a Document Type Definition). This particular Document Type Declaration indicates that the outermost containing element in the document begins with a tag named myDocument.

The format of the statement tells  where to find a file containing the DTD.  The keyword SYSTEM indicates that the DTD is in a separate or external file.  There are other formats as well.

An XML document can have an external DTD as above, an internal DTD (the DTD can simply be prepended onto the document), some combination of the two, or no DTD at all. Regardless, an XML document must always be well-formed.

java and XML
XML by itself really isn't very useful. On the bottom line, XML is nothing more than a specification for how to create structured documents and data. To be useful, the XML document must be combined with a program designed to do something useful with that document.

Java is a strong contender for the writing of programs that do useful things with XML documents. An example is available showing how to convert an HTML document to an XML document, and then to use a Java parser from Microsoft to parse and analyze that document.

Richard G. Baldwin

coming attractions...

Trying to get your arms around XML is sort of like trying to put an octopus in a bottle. Every time you think you have it under control, a new tentacle shows up. In other words, XML has many tentacles, reaching out in all directions. But, that's what makes it fun. I will discuss many of the tentacles in upcoming articles.

Credits: These HTML pages were produced using the WYSIWYG features of Microsoft Word 97. The computer image used on this page was used with permission from the Microsoft Word 97 Clipart Gallery.

310913

Copyright 2000, Richard G. Baldwin

About the author

Richard Baldwin is a college professor and private consultant whose primary focus is a combination of Java and XML. In addition to the many platform-independent benefits of Java applications, he believes that a combination of Java and XML will become the primary driving force in the delivery of structured information on the Web.

Richard has participated in numerous consulting projects involving Java, XML, or a combination of the two.  He frequently provides onsite Java and/or XML training at the high-tech companies located in and around Austin, Texas.  He is the author of Baldwin's Java Programming Tutorials, which has gained a worldwide following among experienced and aspiring Java programmers. He has also published articles on Java Programming in Java Pro magazine.

Richard holds an MSEE degree from Southern Methodist University and has many years of experience in the application of computer technology to real-world problems.

baldwin@austin.cc.tx.us
Baldwin's Home Page

-end-