What is XML, Part 2

by Richard G. Baldwin
baldwin@austin.cc.tx.us
Baldwin's Home Page

Dateline: 05/06/99

prolog
In the previous article in this series, we had a little fun pretending that it was 1981 and we had just invented our own markup language. I promised that this article would introduce some of the jargon that surrounds that invention, and would also get a little more technical.

we devised a scheme
We devised a scheme for causing selected material in a document to be printed in bold and italics. The scheme went something like this. By enclosing the word bold in left and right angle brackets "<>" and by embedding that in our text, we caused our program to use that information to print selected text in a bold font.

what did it look like?
When we viewed some of our text on the screen inside our text editor, it might have looked like this. (But without the color, because we didn't have many color monitors in 1981.)

The <bold>big fat gray fox </bold> jumped over the little fence.

Note the use of <bold> and </bold> in the above text.

When we viewed the same text as printed on our printer, it might have looked like this (again, without color).

The big fat gray fox jumped over the little fence.

Note that big fat gray fox is rendered in boldface, and the material in the angle brackets doesn't appear.

now for a little jargon
We're going to call those things enclosed in angle brackets tags. We will call the first one in the pair that causes the printing to switch to bold the start tag, and will call the other one the end tag.

We will call the following set of characters an element and will call the characters in between the tags (rendered in blue) the content.

So, according to our definition, an element is the combination of the start tag, the end tag, and the content between them.

three bodies of information
Possibly without realizing it, we have incorporated three separate bodies of information into a process that has produced the desired result.

The three bodies of information are:

The raw text for the memorandum. (The information that we want the boss to see.)
A definition of the types of tags allowed in the document and the relationship among those tags. (Information that our program uses to know how and when to send control codes to the printer.)
The manner in which the raw text will be rendered on the printer as a result of the tags in the document. (A specification as to which printer codes are triggered by which tags.)

the raw text
The raw text item is pretty obvious and doesn't require much of an explanation. It is simply the text that makes up the words that we want to express in the memorandum.

definition of the types
Item 2 is a little more abstract. Since we were the designers of the tag set we hard-coded our program to be on the lookout for the <bold> and <italic> tags and to take some specific action when they were encountered. In other words, we hard-coded the tag definition into the program.

what will it look like?
Item 3 basically has to do with the action to take whenever our program encounters one of the tags.

An important point is that if we had wanted to make our program more general, we could have avoided hard-coding the information into the program. We could have read three separate files of information and could have processed them together in order to produce the desired result.

a little more jargon
One file would contain the raw text that makes up the document along with the embedded tags (often called embedded markup or markup for short).

A second file would contain a definition of allowable tags and the relationships among them (such as nesting, for example). We will refer to this file as a Document Type Definition or DTD to be consistent with modern terminology in this area.

The third file would specify the action to be taken in rendering the data as a result of encountering the allowable tags in the raw text file. We will refer to this as a stylesheet to be consistent with modern terminology.

two separate actions
Then our program would need to perform two separate actions. The first action would be to validate the raw text file against the DTD to confirm that all the tags contained in the raw text file are in conformance with the definitions in the DTD file. If our program found tags that were not in conformance, processing would probably have ended at that point with some well-chosen diagnostic messages.

If the raw text file did validate against the DTD, then our program would have used the stylesheet file to determine how to render the text onto the printer.

what do I mean by rendering?
For whatever reason, the person who creates a stylesheet might specify that the text which comprises the content between a pair of <bold> tags should be rendered like this,

this is bold text

and the text which comprises the content between a pair of <italic> tags should be rendered like this:

this is italic text

everyone knows the difference...
"That's silly," you say. Everyone knows the difference between bold text and italic text. This person has it backwards. Well, maybe so and maybe not. The point is that if we were very capable programmers, we could write our program to render the data in accordance with the specifications in the stylesheet even if those specifications are unusual and possibly not what we would normally expect to see.

While it might be silly to have a stylesheet that causes data surrounded by <bold> tags to be rendered in italics, it might not be silly in some circumstances to cause that text to be rendered in a different color. A document that uses green text for emphasis might be more effective in some cases than the same document using ordinary bold for emphasis.

a more sophisticated program
So, we now have a requirement for a much more sophisticated program. We changed our new markup language from one that uses a hard-coded style definition to render text based on a hard-coded tag structure definition to one where both the allowable tag structure and the rendering specifications can change from one document to the next.

The earlier case where everything was hard-coded is somewhat analogous to the Hypertext Markup Language (HTML).

The new approach where both the tag structure definition and the rendering specification can vary from one case to the next is somewhat analogous to the Standard Generalized Markup Language (SGML).

what is SGML
SGML is an ISO standard (ISO 8879:1986) which provides a formal notation for the definition of generalized markup languages. SGML is not a language in itself. Rather, it is a metalanguage that is used to define other languages.

HTML versus SGML
HTML implements some of the concepts derived from SGML but in effect the DTD and the Style Sheet are hard-coded into the browser software. Because each browser manufacturer has some flexibility in implementing the intended style, the same document will sometimes look different when rendered with two different browsers. This is a shortcoming of HTML.

Another shortcoming of HTML is that as it becomes desirable for browsers to accommodate new capabilities, it is first necessary to create a new version of the HTML specifications and then wait for the browser manufacturers to implement the new specification.

the wheels turn slowly
This procedure makes it very difficult for Web page designers to be confident that the content they include in their HTML pages will be properly rendered by every browser used to access the page. In fact, the author can usually be confident that this will not be the case. Web page designers are constantly faced with the problem of designing workarounds to compensate for the deficiencies in some versions of some browsers being used to view the page.

what the world needs now is...
What the Web community needs is an approach where a standard browser is simply a rendering engine that validates a document according to a given DTD and renders it according to a given stylesheet.

a package deal
The combination of the document, the DDT, and the stylesheet would constitute a package delivered by a server to the browser. The author of the document would provide the DTD and the stylesheet in addition to the data to be rendered. Then the author could be more confident that it would be rendered properly, especially for complex data.

And that brings us to XML or the eXtensible Markup Language.

Tune in next week, same time, and same station.

Richard G. Baldwin

coming attractions...

In the next article, I jump right into the hard core material and provide some definitive answers to the question, What is XML?

In the meantime, if you would like to see what another author has to say about the subject and take a look at some serious technical material, you might want to take a look at Peter Flynn's FAQ

Credits: These HTML pages were produced using the WYSIWYG features of Microsoft Word 97. The computer image used on this page was used with permission from the Microsoft Word 97 Clipart Gallery.

About the author

Richard Baldwin is a college professor and private consultant whose primary focus is a combination of Java and XML. In addition to the many platform-independent benefits of Java applications, he believes that a combination of Java and XML will become the primary driving force in the delivery of structured information on the Web.

Richard has participated in numerous consulting projects involving Java, XML, or a combination of the two. He frequently provides onsite Java and/or XML training at the high-tech companies located in and around Austin, Texas. He is the author of Baldwin's Java Programming Tutorials, which has gained a worldwide following among experienced and aspiring Java programmers. He has also published articles on Java Programming in Java Pro magazine.

Richard holds an MSEE degree from Southern Methodist University and has many years of experience in the application of computer technology to real-world problems.

baldwin@austin.cc.tx.us
Baldwin's Home Page

-end-