Richard G Baldwin (512) 223-4758, baldwin@austin.cc.tx.us, http://www2.austin.cc.tx.us/baldwin/

Programming with HTML, XML, and SGML

Java/Web Programming, Lecture Notes # 820, Revised 08/22/99.


Preface

Students in Prof. Baldwin's Advanced Java Programming classes at ACC are not responsible for
knowing and understanding the material in this lesson.

The material was prepared as the beginning of a new tutorial that I hope to complete on Web programming in general.  It is not part of the Java academic curriculum at ACC. I am posting the material along with the Advanced Java Tutorial on a temporary basis.  At some point, I hope to gather all of the new material together into a different tutorial with a different slant.
 

Introduction

This lesson is not going to teach you how to write HTML.  There are numerous outstanding documents already available on the Web to satisfy that need and more.  You can find one of my favorites, Learning HTML 3.2 by Examples, by Jukka Korpela, at http://www.hut.fi/~jkorpela/HTML3.2/.

I'm going to be completely open and tell you up front that I really don't have any expertise in HTML, XML, or SGML.  My expertise is in Java programming.  However, because Java is used in a variety of Web applications, I have a very strong interest in everything that Java is used for.

One of the things that Java is being used for is the processing of XML files.  Therefore, in order to pursue my Java interests, it has been necessary for me to teach myself a little about XML, and along the way, I couldn't help but also learn a little about HTML and SGML.

I decided to share what I have learned with those who possibly know even less about the subject than I do.  If you are already knowledgeable about these topics, you may not find anything in this lesson that is new to you.  But again, you might find some tidbit of information that has escaped you in the past.  This is particularly true if you learned to use HTML by memorizing the rules instead of understanding the rules.

In this lesson, I will try to teach you how to understand HTML and XML.  I will do this in a roundabout way by teaching you something about the origins of HTML which is the Standard Generalized Markup Language (SGML).

I will also teach you something about a new variation of SGML called the eXtensible Markup Language (XML) which has the potential of becoming the predominate vehicle for serving information on the Web.

I believe that once you understand what I have to say about these topics, it will be easy for you to understand HTML because you will know about its roots. Then you will be able to use HTML from the standpoint of understanding rather than simply from a standpoint of memorization.

This lesson also won't teach you a lot about SGML, but hopefully, it will get you far enough along that path that you will understand what you are doing when you write HTML statements into a document.  Also, it will teach you quite a bit about XML which may be the "new wave" on the Web.
 

The Beginnings

Let's go way back to the early 1980s.  You have just purchased your shiny new Personal Computer from IBM and a Pascal compiler from Borland to go with it.  The Pascal development package includes a plain text editor that you can use to create Pascal source code.  The facilities of the editor make it easy to create plain text and save it in a file.

You decide to use the use the editor to create a memorandum for a project that you and an associate are working on.  You write the memorandum using the text editor, print a copy, and take it to your associate for review before distributing it.

The purpose of the memorandum is to convince the boss to purchase some additional equipment for the project.  Your associate has more of a sales mentality than you, so she takes out a red pencil and begins to mark up your plain text memorandum (note the use of the term "mark up"  here).  When she has finished, you look at what she has written with her red pencil.  She has suggested that some words in the memorandum should be in bold, some words should be underlined, and some words should be in italics.

But this is a problem.  Your plain text editor from the Borland Pascal 1.0 IDE doesn't have bold, underline, or italics capability.  After all, a Pascal compiler doesn't care about such things. And you don't have anything resembling a word processor.

So, you go back to your office to ponder the situation.  The text editor is the closest thing that you have to a word processor (remember, word processors have just barely been invented a few years earlier and even they were still pretty crude by today's standards).

However, you do have a couple of things in your favor.  First, you discover that your printer does have the ability to display characters in plain, bold, italics, or underlined.

Second, you are a pretty fair programmer.  You know how to read the text file from the disk and modify it on the way to the printer.  In short, you know how to generate and insert the control codes that will cause the printer to print the characters in one of the four renderings listed earlier.  What you lack is a scheme for telling your program when to cause the printer to change from one rendering to another.

You hit upon a scheme.  You take note of the fact that nowhere in your memorandum are the left and right angle bracket characters used (< and >) so they are available to use as delimiters to tell your program to change the rendering of the other characters on the paper.

The scheme that you devise is to define some new character combinations that you will insert into your plain text document.  At the point where you want the printer to switch into bold mode, you enter the following text:
 

<bold>

At the point where you want the printer to switch from bold back to plain, you enter the following text:
 

</bold>

Then you write a program that will read the plain text document from the disk, with these new character combinations inserted, and transfer the characters in the file to the printer.  However, whenever the program encounters <bold>, it will not send those characters to the printer.  Rather, it will send the proper code to cause the printer to switch into bold mode.

Similarly, whenever your program encounters </bold>, it will not send those characters to the printer either.  Rather, it will send the proper code to cause the printer to switch from bold back to plain.

When you view some of the text on your screen inside the editor it might look like the following:
 

The <bold>big fat gray fox</bold> jumped over the little fence.

However, when you view the text as printed on your printer, it might look like the following.
 

The big fat gray fox jumped over the little fence.

Congratulations!  You have just invented your own markup language (I told you earlier to pay attention to the use of the them markup)which may have been the predecessor of HTML, XML, and SGML.  Of course it needs a lot more work before it will be really useful.  Also, you need to invent some jargon to go with your invention.

You decide to call those things enclosed in angle brackets tags.  In fact, you decide to call the one that causes the rendering to switch to bold the start tag, and to call the the other one the end tag.

You decide to cause the following set of characters an element and to call the characters in between the tags the content.
 

<bold>big fat gray fox</bold>

So, you decide that an element is the combination of the start tag, end tag, and the content between them.

But, you still have a bit of a problem.  Your associate isn't satisfied yet.  Remember, she believes that the memorandum should look like the following. (Note that the word gray is not only bold, it is also in italics.  In particular, it is in bold-italics.)
 

The big fat gray fox jumped over the little fence.

"Not to worry," you say.  By now you're on a roll.  You simply cause your plain text document to look like the following:
 

The <bold>big fat <italic>gray</italic> fox</bold> jumped over the little fence.

What you have done is to invent another pair of tags:  <italic> and </italic>. To satisfy your associate, you also need to invent and use another pair of tags to cause some of the text to be underlined.

Obviously, you will have to modify your program to make it accommodate these "nested" elements, because you nested the italic element inside the bold element.  The logic of your program will become a little more complicated, but you can handle it.  You make the necessary modifications to your program, use your program to print the document, get agreement from your associate, submit the new document to the boss, get approval for the new equipment, and everyone comes out a winner.  Congratulations!
 

What Does All This Mean?

Now let's get a little more serious.  Possibly without realizing it, you have incorporated three separate bodies of information into a process that has produced the desired product.

The three bodies of information are:

  1. The raw text for the memorandum. (The information that you want the boss to see.)
  2. A definition of the types of tags allowed in the document and the relationship among those tags. (Information that your program uses to know how and when to send control codes to the printer.)
  3. The manner in which the raw text will be rendered on the printer as a result of the tags in the document. (A specification as to which printer codes are triggered by which tags.)

The raw text item is pretty obvious and doesn't require much of an explanation.  It is simply the text that makes up the words that you want to express in the memorandum.

Item 2 is a little more abstract.  In this case, you were the designer of the tag set so you probably hard-coded your program to be on the lookout for the <bold> and <italic> tags and to take some specific action when they are encountered.

Item 3 basically has to do with the action to take whenever one of the tags is encountered by your program.

A very important point, however, is that if you wanted to make your program very general, you could read three separate files of information and process them together in order to produce the desired result.

One file would contain the raw text that makes up the document along with the embedded tags (often called embedded markup or markup for short).

A second file would contain a definition of allowable tags and the relationships among them (such as nesting, for example).  We will refer to this file as a Document Type Definition or DTD to be consistent with modern terminology in this area.

The third file would specify the action to be taken in rendering the data as a result of encountering the allowable tags in the raw text file.  We will refer to this as a stylesheet to be consistent with modern terminology.

Then your program would need to perform two separate actions.  The first action would be to validate the raw text file against the DTD to confirm that all the tags contained in the raw text file are in conformance with the definitions in the DTD file.  If your program found tags that are not in conformance, processing would probably end at that point with some well-chosen diagnostic messages.

If the raw text file does validate against the DTD, then your program would use the stylesheet file to determine how to render the text onto the printer.

What do I mean by rendering the text onto the printer?  For whatever reason, the person who creates a stylesheet might specify that the text which comprises the content between a pair of <bold> tags should be rendered like this: this is bold text, and the text which comprises the content between a pair of <italic> tags should be rendered like this: this is italic text.

"That's silly," you say.  Everyone knows the difference between bold text and italic text and this person has it backwards.  Well, maybe so and maybe not.  The point is that if you were a good enough programmer, you could write your program to use the stylesheet to render the data to the printer in accordance with the specifications contained in the stylesheet even if that specification is unusual and possibly not what you would normally expect to see.

While it might be silly to have a stylesheet that causes data surrounded by <bold> tags to be rendered in italics, it might not be silly in some circumstances to cause that text to be rendered in a differ color if the printer has color capability.  A document that uses red text for emphasis might be more effective in some cases than the same document using ordinary bold for emphasis.

So, we have now required you to write a much more sophisticated program.  We changed your new markup language from one that uses a hard-coded style definition to render text based on a hard-coded tag structure definition to one where both the allowable tag structure and the rendering specifications can change from one case to the next.

The first case where everything is hard-coded is somewhat analogous to the Hypertext Markup Language (HTML).

The second case where both the tag structure definition and the rendering specification can vary from one case to the next is somewhat analogous to the Standard Generalized Markup Language (SGML).  SGML is an ISO standard (ISO 8879:1986) which provides a formal notation for the definition of generalized markup languages. SGML is not a language in itself.  Rather, it is a metalanguage that is used to define other languages.

HTML implements some of the concepts derived from SGML but in effect the DTD and the Style Sheet are hard-coded into the browser software.  Because each browser manufacturer has some flexibility in implementing the intended style, the same document will sometimes look different when rendered with two different browsers. This is a shortcoming of HTML.

Another shortcoming of HTML is that as it becomes desirable for browsers to accommodate new capabilities, it is first necessary to create a new version of the HTML specifications and then wait for the browser manufacturers to implement the new specification.  This makes it very difficult for Web page designers to be confident that the content that they include in their HTML pages will be properly rendered by every browser used to access the page.  In fact, at this point in history, the author can usually be confident that this will not be the case so Web page designers are constantly faced with the problem of designing workarounds to compensate for the deficiencies in some versions of some browsers being used to view the page.

What the Web community needs, therefore, is an approach where a standard browser is simply a rendering engine that validates a document according to a given DTD and renders it according to a given Style Sheet.  The document, the DDT, and the stylesheet would constitute a package delivered by a server to the browser.  The author of the document would provide the DTD and the Style Sheet in addition to the data to be rendered.  Then the author could be more confident that it would be rendered properly (provided that the requirements specified by the DTD and the Style Sheet were within the specified capabilities of the standard browser).

And that brings us to XML or the eXtensible Markup Language.
 

What is XML?

My attempt to answer the question: "What is XML?"  will be based heavily on information obtained from The XML FAQ http://www.ucc.ie/xml/, Version 1.21 (3 February 1998). You are strongly encouraged to visit that site for additional and more detailed information.
 

The above mentioned FAQ is maintained on behalf of the World Wide Web Consortium's XML Special Interest Group by Peter Flynn, (University College Cork), with the collaboration of Terry Allen, (), Tom Borgman, (Harlequin Ltd), Tim Bray, (Textuality, Inc),  Robin Cover, (Summer Institute of Linguistics), Christopher Maden, (O'Reilly & Associates), Eve Maler, (Arbortext, Inc), Peter Murray-Rust, (Nottingham University), Liam Quin, (), Michael Sperberg-McQueen, (University of Illinois at Chicago), Joel Weber, (MIT), Murata, Makoto (Fuji Xerox Information Systems), and many other members of the XML Special Interest Group of the W3C as well as FAQ readers around the world.

XML is an acronym for "Extensible Markup Language".  It is considered to be extensible because it does not use a fixed format like HTML. Neither the DTD nor the stylesheet for an XML document is fixed.

XML derives from SGML.  Like SGML, XML is a metalanguage designed to make it possible for you to design your own customized markup language.  In fact, XML is a subset of SGML, with a couple of differences.  XML is designed such that an XML document with a DTD can be processed using an SGML processor.

The purpose of XML is to provide many of the capabilities of SGML in such a way that XML documents can be used on the Web in much the same way that HTML documents are now used.

According to the FAQ mentioned above:
 

XML is an abbreviated version of SGML, to make it easier for you to define your own document types, and to make it easier for programmers to write programs to handle them. It omits the more complex and less-used parts of SGML in return for the benefits of being easier to write applications, easier to understand, and more suited to delivery and interoperability over the Web. But it is still SGML, and XML files may still be parsed and validated the same as any other SGML file ... 

XML is very important to Web development because it will allow the development of user-defined document types on the Web.  It breaks the mold of fixed-format document types imposed by HTML while eliminating much of the complexity associated with SGML.

While XML makes SGML simpler, it retains the abilities of SGML that let you define your own document types.  In addition, unlike SGML, XML introduces a new feature which eliminates the requirement for a specific DTD.  While XML makes it possible for a document to have a DTD, this option also makes it possible for a document to define its on type "on the fly"  without a requirement for a DTD.

This leads us to introduce two important terms ( "valid" documents and "well-formed" documents) which we will discuss later.

According to the FAQ:
 

XML will allow groups of people or organizations to create their own customized markup languages for exchanging information in their domain (music, chemistry, electronics, hill-walking, finance, surfing, linguistics, mathematics, knitting, history, engineering, rabbit-keeping etc). 

HTML is at the limit of its usefulness as a way of describing information, and while it will continue to play an important role for the content it currently represents, many new applications require a more robust and flexible infrastructure. 

A very important point for those of you who may be interesting in migrating from HTML to XML is that although XML syntax is very similar to HTML syntax, XML is case sensitive.

Case sensitivity has historically been a stumbling block for many aspiring C and C++ programmers and it could be just as troublesome here.  (It is only a stumbling block because the Pascal language that they were taught originally is not case sensitive, but then neither is HTML.) In short, you need to learn to be concerned about the case of everything that you enter into an XML document.

You should be able to convert an existing HTML document to an XML document (with no DTD) by making it "well-formed" and making a few other changes. This isn't difficult, and we will discuss what well-formed means later.

In addition, you will need to use &lt; and &amp; in place of "<" and "&" but many of you are already doing that anyway.

You need to make certain that all attribute values are in quotes.  You may or may not be doing that.  Browsers allow HTML authors to be very sloppy in this regard.

Perhaps most difficult of all, you must make certain that element names in start tags and end tags match with respect to upper and lower case.

Fortunately, parser software is available to help you find problems of this nature in your XML documents.  Once you learn to recognize the diagnostic messages produced by parser software, fixing the problem is usually relatively easy.

Also, according to the FAQ:
 

If you have created your HTML files conforming to one of the several HTML Document Type Definitions (DTDs), and they validate OK, then they can be converted as follows: 

Replace the DOCTYPE declaration and any internal subset (basically everything within the first set  of angled brackets  <!DOCTYPE HTML...>) 

with the XML Declaration 

<?XML version="1.0"   standalone="yes"?> 

This probably won't mean much to you at this point, but hopefully it will by the end of this lesson.

One question that may be of interest is: Once you have an XML document, can you use it directly with an SGML processor?  The writers of the FAQ give a qualified yes to this question.  According to the FAQ, the answer is yes, but:
 

At the moment there are few tools which handle XML files unchanged because of the format of these EMPTY elements, but this is changing. The nsgmls parser has an experimental XML conformance switch, and the first XML-specific editors and parsers are appearing (see the question on software). 

Throughout this lesson, I have been mentioning a Document Type Definition (DTD).  It's time to provide a credible definition for a DTD.  According to the FAQ:
 

A DTD is usually a file (or several files to be used together) which contains a formal definition of a particular type of document. This sets out what names can be used for elements, where they may occur, and how they all fit together. For example, if you want a document type to describe <LIST>s which contain <ITEM>s, part of your DTD would contain something like 

    <!ELEMENT item (#pcdata)> 
    <!ELEMENT list (item)+> 

This defines items containing text, and lists containing items. 

It's a formal language which lets processors automatically parse a document and identify where every element comes and how they relate to each other, so that stylesheets, navigators, browsers, search engines, databases, printing routines, and other applications can be used. 

Now let's take a closer look at the terms DTD, valid, and well-formed.

As mentioned early in this lesson, an SGML document is really the combination of three parts. Let's refer to the parts as files just so we will have something to call them (but they don't have to be separate physical files).

One file contains the content of the document (words, pictures, etc.).  This is the part that the author wants to expose to the client.

A second file is the DTD which meets the definition given above.

A third file is a stylesheet that establishes how the content that conforms to the DTD is to be rendered on the output device.  This is how the author wants the material to be presented to the client.  For example a tag with an attribute of "red" might cause something to be presented bright red according to one stylesheet and dull red according to another stylesheet.  (It might even be presented as some shade of green according to still another stylesheet.)

With HTML, the DTD and the stylesheet are essentially hard-coded into the browser.  With SGML, both a DTD and a stylesheet are required by the processor.

With XML, the DTD is optional but the stylesheet (or some processing mechanism that substitutes for a stylesheet) is required.

According to the FAQ, XML does not require a DTD:
 

Full SGML uses a Document Type Definition (DTD) to describe the markup (elements) available in any specific type of document. However, the design and construction of a DTD can be a complex and non-trivial task, so XML has been designed so it can be used either with or without a DTD.  DTDless operation means you can invent markup without having to define it formally. 

To make this work, a DTDless file in effect `defines' its own markup, informally, by the existence and location of elements where you create them. But when an XML application such as a browser encounters a DTDless file, it needs to be able to understand the document structure as it reads it, because it has no DTD to tell it what to expect, so some changes have been made to the rules. 

Now for valid.  A valid document is one that conforms to the DTD in every respect.  In other words, unless the DTD allows a tag with a name of "color", an XML document containing a tag with that name is not valid.  SGML processors require the document to be valid.  Because XML does not require a DTD, an XML processor cannot require validation of the document.

As of the time of this writing, I have only worked with one XML parser and that is one that is currently available from Microsoft.  If a DTD is provided with the XML document, the MSXML parser will attempt to validate the document according to the DTD.  If a DTD is not provided, the parser skips the validation step.

What about well-formed?  As I understand it, being well-formed is not a requirement of SGML.  Rather it was introduced as a requirement of XML, apparently to deal with the situation where a DTD is not available.

According to the FAQ:
 

For example, HTML's <IMG> element is defined as `EMPTY': it doesn't have an end-tag. Without a DTD, an XML application would have no way to know whether or not to expect an end-tag for an element, so the concept of `well-formed' has been introduced. This makes the start and end of every element, and the occurrence of EMPTY elements completely unambiguous. 

All XML documents must be well-formed.  According to the FAQ, if there is no DTD in use, the document must start with a Standalone Document Declaration (SDD) that looks like the following:
 

<?XML version="1.0" standalone="yes"?> 

Although I'm not absolutely certain at this time, I don't believe that the current version of the MSXML parser enforces this requirement.

To be well-formed, all elements that can contain character data must have both start and end tags.

All attribute values must be in quotes (single quotes or double quotes).  You can surround the value with apostrophes (single quote) if the attribute value contains a double quote.  An attribute value that is surrounded by double quotes can contain apostrophes.

EMPTY elements (those that contain no character data) must be written in one of the following two ways:
 

<foo/>
<foo></foo>

(Note that an EMPTY element can contain one or more attributes inside the start tag.) The first example shown above is probably preferable because parsing problems can arise for an empty element when the start tag is on one line and the end tag ends up on the next line as shown below:
 

<foo> 
</foo>

The problem here is that this element isn't really EMPTY.  It contains a newline.

For a document to be well-formed, it must not have markup characters (< or &) in the text data.  If such characters are needed, you can represent them using &lt; and &amp; instead.

The sequence ]] must be written as ]]&gt; if it does not occur as the end of a section marked as CDATA.

Elements must nest properly. In other words, if one element contains another, the entire second element must be defined inside the start and end tags of the first element.

According to the FAQ:
 

Well-formed files with no DTD may use attributes on any element, but the attributes must all be of  type CDATA by default. 

Well-formed XML files with no DTD are considered to have &lt;, &gt;, &apos;, &quot;, and &amp; predefined and thus available for use even without a DTD. 

Valid XML files must declare them explicitly if they use them. 

Again, although I'm not absolutely certain, I don't believe that the MSXML parser enforces the last sentence in the above box.  In fact, my experience is that when attempting to define these entities in a DTD, the parser issues a warning to the effect that they are already defined.

Valid XML files are those which have a DTD and which conform to the DTD. XML files must be well-formed, but there is no requirement for them to be valid. A DTD is not required in which case validity is impossible to establish. However, if they have a DTD, they must conform to it which makes them valid.

A valid XML file has a statement similar to the second statement in the following example as the first or second statement in the file (excluding comments):
 

    <?XML version="1.0"?> 
    <!DOCTYPE myDocument SYSTEM "http://HostName/FileName.dtd"> 

This statement is known as a Document Type Declaration (as distinguished from a Document Type Definition). This particular Document Type Declaration indicates that the outermost containing element in the document begins with a tag named myDocument.

The format of the statement tells  where to find a file containing the DTD.  The keyword SYSTEM indicates that the DTD is in a separate or external file.  There are other formats as well.

An XML document can have an external DTD as above, an internal DTD (the DTD can simply be prepended onto the document), some combination of the two, or no DTD at all. Regardless, an XML document must always be well-formed.
 

Programming Example for XML Documents

This section contains a programming example that uses the Java programming language in conjunction with the Microsoft XML Parser to illustrate how you might process XML files in simple ways.
 

The Microsoft Parser

There are several parsers currently available for free download on the Web.  At this point in time (5/2/98), XML is relatively new to me and I have only used the Microsoft parser.

As of this writing on 5/2/98, an XML parser was available for free downloading from Microsoft at http://www.microsoft.com/workshop/author/xml/parser/.  Remember, Web addresses often change so this page may have moved by the time you read this and it may be necessary for you to search for it.  If that happens, please let me know and I will correct the link in the online version of this document.

In addition to the Microsoft parser, there is another parser named Lark available at http://www.textuality.com/Lark/.  I plan to download it soon and give it a try.

The Microsoft XML (MSXML) parser consists of a set of Java classes that experienced Java programmers can use to process XML documents.

Note that the program in this section involves some non-trivial programming constructs such as recursion, enumeration, and downcasting..  While the information on XML discussed here is fairly elementary, the Java programming constructs are probably not for beginning programmers.
 

Displaying an XML File in Indented Format

The first thing that we are going to do is to display, in indented format, the element tree structure of a very simple XML document having no DTD.  This is indicative of the kind of information that the MSXML parser provides in at least one of its operating modes.

Because many of you will probably have written HTML and will be familiar with the standard tags used in HTML, we will use a document originally created as an HTML document using the WYSIWYG Composer capability of Netscape Communicator 4.04.

We will write a Java application to process that document, but first it will be necessary for us to make the document well-formed.  To make it well formed, we need to do the things described in an earlier section that discussed converting an HTML document to an XML document and we will do that manually.

The well-formed version of the document is shown below.  If you cut and paste this document into a separate file, it will probably still be compatible with your HTML browser.  It will display some text before and after a yellow table that has three rows and two columns.  There is some text in each of the cells in the table.  As you view this material, you should recognize the existence of start and end tags such as

<HEAD>...</HEAD>

There are several EMPTY elements such as:

<META NAME="Author" CONTENT="Richard G. Baldwin"/>

Note the "/" that I added immediately before the closing ">" in order to make the document well-formed.

You should recognize nested elements such as:

<HEAD>...<META.../>...</HEAD>

We aren't going to worry much about what these tags mean.  You can learn what they mean by visiting the site that I referenced earlier for learning how to use HTML.  Our objective in this case is simply to see and process a well-formed XML document.  The fact that it is also an HTML document is not coincidental because I planned it that way.  However, the fact that it is also an HTML document is in no way essential to what we are trying to show here.

Take a look at the following well-formed HTML document that does not have a DTD.
 

<HTML>
<HEAD>
   <META HTTP-EQUIV="Content-Type" 
         CONTENT="text/html; charset=iso-8859-1"/>
   <META NAME="Author" 
         CONTENT="Richard G. Baldwin"/>
   <META NAME="GENERATOR" 
         CONTENT="Mozilla/4.04 [en] (Win95; I) [Netscape]"/>
   <TITLE>XMLParse01a</TITLE>
</HEAD>
<BODY>

File Name:  XMLParse01a.xml.
<BR>XML file originally generated using an HTML WYSIWYG editor.</BR>

<CENTER><P>Simple yellow table containing three rows and two
columns.</P></CENTER>
<BR>&gtp</BR>
<TABLE BORDER="2" COLS="2" WIDTH="100%" BGCOLOR="#FFFF99"> 
<TR>
<TD>Row 1, Col 1</TD>

<TD>Row 1, Col 2</TD>
</TR>

<TR>
<TD>Row 2, Col 1</TD>

<TD>Row 2, Col 2</TD>
</TR>

<TR>
<TD>Row 3, Col 1</TD>

<TD>Row 3, Col 2</TD>
</TR>
</TABLE>
<BR></BR>
This is <B>bold text</B> after the table with <U>underlined</U> word.

<P>end</P>
</BODY>
</HTML>

Before we look at the Java program that I wrote to process this XML file, let's look at the output produced by the program.  This program used used the MSXML parser to extract the information from the document and put it into a tree structure of nested Element objects.  I wrote a display handler to display that tree information in an indented notation showing the nesting characteristic of the elements in the document.

You should be able to correlate the information in the following output with the raw XML document shown above.  For example, all of the material in the XML document is nested inside an element with start and end tags as shown below:

<HTML>...</HTML>

In the XML document, the start tag is on the first line and the end tag is on the last line.  That structure is reflected in the indented tree shown below.  Note that ELEMENT HTML is the top-level element and all other elements are indented relative to that element.  Nesting is indicated by varying levels of indentation.

If you go back to the raw XML document and examine the first element of type META, you will see that this element has two attributes:

HTTP-EQUIV="Content-Type"
CONTENT="text/html; charset=iso-8859-1"

If you examine the parser output shown below, you will see that those two attributes have been extracted and delivered in a format that separates the NAME of each attribute from its VALUE and also associates each attribute with the element that "owns" it.  This makes the attribute easier to process than when it is in its raw form.

If you move on down to the TABLE element, you will see an example of multiply nested elements with attributes and data values. The TABLE element has four attributes, BORDER, COLS, WIDTH, and BGCOLOR.

Nested inside the TABLE element are three elements of type TR.  This is the HTML terminology for a table row.

Nested inside each TR element are two elements of type TD.  This is the HTML terminology for table data.

Each TD element contains text data, designated as PCDATA.  For example, the text data contained in the first TD element is: "Row 1, Col 1".

So, the raw XML file shown above, which is somewhat haphazard, has been converted by the processor into a series of well-ordered pieces of information, each properly labeled and provided in a format that is suitable for processing.
 

ELEMENT HTML 
-ELEMENT HEAD 
--ELEMENT META 
--**Attr Name=HTTP-EQUIV Value=Content-Type
--**Attr Name=CONTENT Value=text/html; charset=iso-8859-1
--ELEMENT META 
--**Attr Name=NAME Value=Author
--**Attr Name=CONTENT Value=Richard G. Baldwin
--ELEMENT META 
--**Attr Name=NAME Value=GENERATOR
--**Attr Name=CONTENT Value=Mozilla/4.04 [en] (Win95; I) [Netscape]
--ELEMENT TITLE 
---PCDATA XMLParse01a
-ELEMENT BODY 
--PCDATA  File Name: XMLParse01a.xml. 
--ELEMENT BR 
---PCDATA XML file originally generated using an HTML WYSIWYG editor.
--ELEMENT CENTER 
---ELEMENT P 
----PCDATA Simple yellow table containing three rows and two columns.
--ELEMENT BR 
--ELEMENT TABLE 
--**Attr Name=BORDER Value=2
--**Attr Name=COLS Value=2
--**Attr Name=WIDTH Value=100%
--**Attr Name=BGCOLOR Value=#FFFF99
---ELEMENT TR 
----ELEMENT TD 
-----PCDATA Row 1, Col 1
----ELEMENT TD 
-----PCDATA Row 1, Col 2
---ELEMENT TR 
----ELEMENT TD 
-----PCDATA Row 2, Col 1
----ELEMENT TD 
-----PCDATA Row 2, Col 2
---ELEMENT TR 
----ELEMENT TD 
-----PCDATA Row 3, Col 1
----ELEMENT TD 
-----PCDATA Row 3, Col 2
--ELEMENT BR 
--PCDATA  This is 
--ELEMENT B 
---PCDATA bold text
--PCDATA  after the table with 
--ELEMENT U 
---PCDATA underlined
--PCDATA  word. 
--ELEMENT P 
---PCDATA end

.

Adding a DTD to the XML Document

The XML document in this example was parsed using the MSXML parser without benefit of a DTD. Therefore, it made no attempt to validate the document.

Now I am going to show you a simple DTD that could be used to validate this XML document.  Please note that this DTD is not generalized in any way.  I designed it specifically to match the XML document.  An actual DTD for use with HTML documents is a long and complex document.  You can find information on a DTD for HTML at  (sorry, but the link previously shown here is no longer valid as of 6/16/99).

Another good source of information is the book HTML Unleashed.  As of 5/2/98, excerpts from this book are available online and I will give you the URL a little later.

Please be aware that this section on the DTD barely skims the surface of what you need to know to be able to create a DTD.  If you need to create a DTD for an XML document, you will probably need to refer to sources of information beyond this one.  This section is intended solely as a very brief introduction into the creation and use of a DTD.

The complete listing for the custom DTD used in this example is shown below.  I will explain the individual parts following the listing.
 

<?XML version = "1.0"?>
<!DOCTYPE HTML[

<!ENTITY % text "#PCDATA | B | U | BR | P | TABLE | CENTER">

<!ELEMENT HTML (HEAD,BODY)>
<!ELEMENT HEAD (META*,TITLE*)>
<!ELEMENT META EMPTY>
<!ATTLIST META
    HTTP-EQUIV  CDATA #IMPLIED
    CONTENT  CDATA #IMPLIED
    NAME  CDATA #IMPLIED
    CONTENT  CDATA #IMPLIED
    NAME  CDATA #IMPLIED
    CONTENT  CDATA #IMPLIED>
<!ELEMENT TITLE (PCDATA)>
<!ELEMENT BODY (%text;)*>
<!ELEMENT BR (%text;)*>
<!ELEMENT CENTER (%text;)*>
<!ELEMENT P (%text;)*>

<!ELEMENT TABLE (TR)*>
<!ATTLIST TABLE
    BORDER  CDATA #REQUIRED
    COLS  CDATA #REQUIRED
    WIDTH  CDATA #REQUIRED
    BGCOLOR  CDATA #REQUIRED>
<!ELEMENT TR (TD,TD )>
<!ELEMENT TD (PCDATA)>
<!ELEMENT B (PCDATA)>
<!ELEMENT U (PCDATA)>
]>

 

A DTD can either be included at the beginning of an XML file, or can be provided as a separate file that is referenced at the beginning of the XML file, or a combination of the two.  This DTD should work either way, provided the XML document is updated to expect an external DTD n a separate file.

Consider first the DOCTYPE statement in the DTD.  In the following fragment, I have deleted the entire body of the DTD and have shown the DOCTYPE element with the proper syntax from beginning to end.  As indicated, this is a DTD for use with XML documents whose outer element is of type HTML. (Again, lest there be any mistake, this is not really a DTD for a general HTML document.  It is a very specialized DTD for my XML document which just happens to have an outer element of type HTML because I created it using a WYSIWYG HTML editor and then modified it to make it well-formed.)
 

<!DOCTYPE HTML[
...
]>

The DOCTYPE statement is followed immediately by an entity declaration, so the next thing that we want to look at is an entity declaration.  An entity is something that is substituted in place of something else, sometimes for convenience, and sometimes out of necessity.

For example, there are certain situations where you are not allowed to include the ampersand character directly in the text that you create in your XML document, but need for that character to be there anyway.  There is a predefined entity that the you can enter in place of the ampersand character using the following syntax:  &amp;  This is an entity reference.  Then, at the appropriate time, the system will automatically substitute an actual ampersand character in place of the entity so that it will be there when needed.

There are two sides to every entity.  You can reference an entity by name in your document, in which case you want your reference to be replaced by the value of the entity.  As mentioned above, many of you will be familiar with the use of entity references such as &amp; in a document.  This particular entity reference normally means to replace the reference by the predefined value of an entity whose name is amp.  As a practical matter, this entity reference is often used to cause an ampersand (&) to appear in a document in a location where the direct inclusion of the ampersand character when the document is created is prohibited.

Without going into detail, let me simply say that entity references can also appear in a DTD and in this case, the reference might look something like %text;.  Note the use of the percent sign in place of the ampersand that is normally used for an entity reference in the XML document itself. Entity references in the DTD begin with the percent character, not the ampersand character.

The following partial quotation from the book currently available online (5/2/98) at http://www.webreference.com/dlab/books/html/3-5.html#3-5-2-1 explains the required syntax of an entity declaration.
 

Let's consider ... the syntax of an entity declaration.  It uses the ENTITY statement that, like all other SGML statements, requires a ! after the start delimiter <.  After the ENTITY keyword comes the % character indicating that the entity in question is a parameter entity rather than a general entity. 

Separated from % by one or more spaces is the entity name that is later used to invoke the entity.   ... Also recollect that entity names are different from element names in that they are case sensitive. 

The last obligatory component of an entity declaration is the string enclosed in quotation marks (data string) that shows what this entity stands for and what it will expand to when invoked.

This quotation is from the book entitled HTML Unleashed, By Rick Darnell, et al., Sams.net Publishing.  Excerpts from the book are currently available at the URL mentioned above (5/2/98).

The following fragment shows the declaration for an entity named text.  This declaration says that the item to be substituted when the entity is referenced can consist of one, and only one, of the mutually exclusive items connected by the vertical bar character "|".  The "|" is the symbol in XML for an exclusive or connector.  Other connectors can also be used as well and we will see examples later.  See the above book, or any good XML reference document for a list of available connectors.

The material inside the quotes is often referred to as a content model group.  The last six items in this particular group are the names of element types that will be defined later in the DTD.  This means that an entity reference that is replaced by text can contain one of these elements or can contain PCDATA.
 

<!ENTITY % text "#PCDATA | B | U | BR | P | TABLE | CENTER">

The first item in the declaration for the text entity (#PCDATA) needs a little more explanation.  Rather than try to explain it in my own words, I am going to provide a quotation from the same book mentioned above.
 

Besides element names, you can use the #PCDATA (Parsed Character DATA) keyword in model groups.  It refers to "usual" characters of the document without any markup tags and can be used to explicitly allow or disallow plain text within an element. 

It is different, however, from the CDATA keyword discussed earlier.  First, #PCDATA can be used only within a model group and not on its own as CDATA (that is, #PCDATA should be enclosed in parentheses even when it stands alone).  And second, #PCDATA does not imply ignoring markup; if a tag is encountered in the context where only #PCDATA is allowed, a compliant SGML parser should fix an error rather than ignore this tag. 

Together with the connectors and occurrence indicators listed, #PCDATA can limit the set of elements allowed inside another element without prohibiting plain text from appearing there. 

 
At this point in the DTD, we have an entity declaration for an entity named text.  An item of this type substituted into the DTD can consist of text as described above, or one of the elements listed in the entity declaration.  We will make heavy use of references to the text entity in our DTD because our concept of text appears at several places in our XML document.

We saw the exclusive or connector in the previous fragment.  We will see another connector in the next fragment.

This fragment specifies that an element of type HTML can (and must) contain two other elements of type HEAD and BODY.  These two elements must be nested in the HTML element in the order shown.  The connector used here is the comma.  The comma connector is used to specify the order in which elements must be nested in the type of element being defined.

The comma connector doesn't specify whether or not the element must occur, or how many times it can occur.  That is controlled by the use of an occurrence indicator that we will see in a later fragment.  The comma simply defines the order in which the elements must occur if they do occur.  Again, however, in this case the HEAD and BODY elements must occur in that order because they are not modified by an occurrence indicator.
 

<!ELEMENT HTML (HEAD,BODY)>

Now lets look at a line that specifies that two elements can occur in any quantity in a specified order without a requirement that they must occur.

The following fragment specifies that the elements META and TITLE may occur in the HEAD element of the XML document.  They can occur none, one, or more times (as defined by the * occurrence indicator).  If they do occur, they must occur in the order shown.
 

<!ELEMENT HEAD (META*,TITLE*)>

In addition to the *, there are two other occurrence indicators that are used to show how many times an element can occur in a content model.  They are described below:
 

? - means that the element may occur either once or not at all. 
+ - means that the element may occur one or more times 
* - means that the element may occur any number of times or not at all. 
 

That brings us to the concept of an empty element.  As mentioned very early in this lesson, an empty element is one which doesn't have any content specified between its start and end tags.  Even though they don't have content, empty elements can have attributes specified inside the start tag.  Attributes can occur in any order within the start tag.

The following fragment defines an empty element type named META that has six allowable attributes.  The keyword EMPTY specifies that the element is empty.  The keyword ATTLIST is used to indicate the beginning of a list of allowable attributes.
 

<!ELEMENT META EMPTY>
<!ATTLIST META
    HTTP-EQUIV  CDATA #IMPLIED
    CONTENT     CDATA #IMPLIED
    NAME        CDATA #IMPLIED
    CONTENT     CDATA #IMPLIED
    NAME        CDATA #IMPLIED
    CONTENT     CDATA #IMPLIED>

There are three items in the definition of an allowable attribute.  The first item is the name of the attribute that must be used if the attribute appears in the XML document.

The second item defines the type of the value that can be specified for the attribute in the XML document.  In this fragment, the allowable type of value for every attribute is CDATA.  According to HTML Unleashed:
 

CDATA means that the value of this attribute may be any string of characters (as well as an empty string) and should be ignored by the parser.  CDATA is used in situations where it is impossible to force more strict limitations on the attribute value with one of the following keywords...

There are three allowable types for an attribute that I will simply list here without providing an explanation:

  1. string type, such as CDATA
  2. tokenized types
  3. enumerated types, such as (true | false)

The third item in the attribute definition provides a default value for the attribute.  There are three possibilities here as well:

  1. #REQUIRED
  2. #IMPLIED
  3. literal

In the first case, the valid XML document must provide a value for the attribute.

In the second case, the XML document may provide a value for the attribute but is not required to do so.  In this case, if no value is provided, an application-dependent value will be used.  For example, for an IMPLIED attribute named BackgroundColor, an XML processor might accept a value if provided in the XML document, and might cause the background color to be green if an attribute value is not provided. A different XML processor might cause the same default background color to be red.  That is what I mean by "application-dependent value."

The third case allows for the specification of a literal value for the case where an attribute value is not provided by the XML document. There are several possibilities here as well. For example, the following would cause the parser to produce the string "true" as the value of the attribute if the XML document didn't provide a value for the enumerated attribute named Exit (so far, I haven't been able to get this to work properly with the MSXML parser):
 

Exit (true | false) "true"

The next fragment shows that an element of type TITLE cannot contain any nested elements, but can contain text data of type PCDATA discussed earlier.
 

<!ELEMENT TITLE (PCDATA)>

That brings us to one of the trickier aspects of creating this particular DTD.  Generally, the body of an HTML document can contain a variety of different element types ordered many different ways.  For example, the document that you are reading is an HTML document and it mixes text and tables routinely in no pre-planned order (the yellow boxes are actually tables having one row and one column).

SGML provides a connector "&" that allows the specification of unordered elements but it is not supported by XML. One way to make it possible for an element to contain many different types of elements in any order is as shown in the following fragment.
 

<!ELEMENT BODY (%text;)*>

Recall that we declared the text entity earlier.  In that declaration, we said that one insertion of a text entity could consist of either PCDATA or any one of several other element types.

The above statement uses the * occurrence indicator to specify that the BODY element can contain any number of insertions of the entity named text.  Since any one insertion can be any one of several different types, and any number of insertions is allowed, the end result is that the BODY element can contain any combination of the following items in any order.  The first item is text data and the remaining items are elements to be defined later in the DTD (recall that these were the items in the content group in the definition of the entity named text).

The definitions for three of these elements are shown in the next fragment.  As you can see, these three element types are also defined to contain text according to our definition of text.
 

<!ELEMENT BR (%text;)*>
<!ELEMENT CENTER (%text;)*>
<!ELEMENT P (%text;)*>

The definition of the TABLE element is contained in the next fragment.  A TABLE element is defined to contain any number of elements of type TR.  TR is HTML terminology for table row.  Thus, a table can contain any number of rows.

In addition, the TABLE element definition as shown here has four required attributes (a real HTML table has more attributes than this).  The first attribute specifies the width of the border when the table is rendered.  The second specifies the number of columns.  The third specifies the width of the table, and the fourth specifies the background color.
 

<!ELEMENT TABLE (TR)*>
<!ATTLIST TABLE
    BORDER  CDATA #REQUIRED
    COLS  CDATA #REQUIRED
    WIDTH  CDATA #REQUIRED
    BGCOLOR  CDATA #REQUIRED>

At least two of the above attributes (BORDER and BGCOLOR), have a #IMPLIED default value in a real HTML document.  I decided to make them all #REQUIRED here to give you exposure to required attributes.

From the above, we conclude that a TABLE element contains any number of nested TR elements.  The next fragment shows our definition of the TR (table row) element. In our case, a TR element contains exactly two TD  elements.  TD stands for table data in common HTML terminology.  This is consistent with the fact that each row of our table is subdivided into two columns.  The cell at the intersection of a row and a column is a TD element.
 

<!ELEMENT TR (TD,TD )>

To wrap up the discussion of the TABLE element, the next fragment shows that each TD element contains exactly one text item of type PCDATA.
 

<!ELEMENT TD (PCDATA)>

Finally, we have two more elements to define:

  1. B which in HTML terminology means bold
  2. U which in HTML terminology means underline

The definition of these two elements in the DTD are shown in the next fragment.  Each element contains exactly one text item of type PCDATA.

That brings us back to where we started at the beginning of this lesson where you invented your own markup language to make it possible to cause certain parts of a plain text document to be printed in bold, italics, and underline.

The same concepts that you used then still hold.  Only the implementation has changed, and that is due to the need for much more capability than simply bold, italics, and underline.
 

<!ELEMENT B (PCDATA)>
<!ELEMENT U (PCDATA)>
]>

.

The Java Program

The next couple of sections will discuss the Java program that was used to produce the indented Element Object tree shown earlier.

This program will parse an XML file with or without a DTD and display an indented element tree on the standard output device identifying and showing the nesting relationship of all of the elements along with the attributes and their values for those elements.

If there is no DTD provided, the validation step is skipped.

If a DTD is provided, the XML file is validated.

Validation and parsing errors are displayed on the standard output device.

This program uses a hard-coded file name for the XML file but it wouldn't be difficult to upgrade it to use a command line argument for the file name. It also wouldn't be difficult to upgrade it to use a GUI for both input and output.

This program uses the Microsoft XML parser to parse the XML file and build of a tree of objects where each object in the tree represents one element in XML terminology.  The code developed by this author traverses the tree and formats it for display on the standard output device.

The program was tested using JDK 1.1.5 and MSXML parser V1.8 under Win95
 

Interesting Code Fragments

The first code fragment shows the package references required for use with the MSXML parser classes.
 

import java.util.*;
import java.net.URL;

//Import MSXML parser classes
import com.ms.xml.om.*;
import com.ms.xml.util.*;
import com.ms.xml.parser.*;

The next fragment is in the main() method.  This fragment

.

class XMLParse01{//controlling class

  public static void main(String[] args){  

    XMLParse01 thisObj = new XMLParse01();
    Document doc = new Document();
    String urlName = 
                   "file:///c:/Baldwin/JavaProg/Combined/"
                                   +"Java/XMLParse01a.xml";

The next fragment makes a call to the method named loadDocument which we will be discussing soon.  This method loads the specified XML document into an object of type Document which is one of the classes in the MSXML package.  If the load is successful, this object is the tree of Element objects containing all of the  available information about the XML document.  If the load is not successful, validation or parser error messages will be displayed on the standard output device.
 

    thisObj.loadDocument(urlName,doc);

Assuming that we were successful in creating the tree, we begin processing the tree to extract information from the tree at this point.

The tree is a tree of objects of type Element which is one of the classes in the MSXML package.  Some of the objects are nodes and some are leaves.  The root node in the tree represents the outer element in the XML document. The leaves represent the inner-most elements which do not contain other nested elements.  The nodes represent elements that contain nested elements.

We will get a reference to the root object by invoking getRoot() on the tree object of type Document referred to by the reference variable named doc.

We will process this object by invoking our own processAnElement() method on the object.  In this simple case, processing consists simply of extracting and displaying information about each of the Element objects in the tree.  In a real-world situation, processing might involve substantially more, but this skeleton should be a good starting point for such a real-world program.

After we process the root object, we begin processing all of the nodes that are children of the root.  We will recursively traverse the tree extracting and displaying all of the available information associated with each node or leaf in the tree.

Once we return from the recursive tree-traversal process, we will display a termination message and terminate the program.  That is the end of the main() method.
 

    Element theRoot = doc.getRoot();

    thisObj.processAnElement(theRoot,0);

    thisObj.traverseTheTree(theRoot,0);

    System.out.println("end");
  }//end main

The next code fragment contains the entire loadDocument() method.  This is the workhorse method that causes the element tree to be constructed.

This method receives a reference to an MSXML Document object along with a String object containing the identification of an XML document file compatible with the URL constructor.

After getting a new URL object, this method invokes the MSXML load() method on the Document object passing the URL object as a parameter.

If the load is successful, this creates the tree of Element objects.  If the load is not successful, parsing or validation errors are displayed on the standard output device.

When this method returns successfully, an object containing an Element tree describing the XML document will be referenced by the Document reference variable created in main() and passed to this method as a parameter.

If there is a parser or validation error, the method uses the reportError() method of the MSXML package to report on the nature of the error.  These are the diagnostic messages provided by the program.

There is also a catch block for catching and reporting on standard exceptions if they occur.
 

  void loadDocument(String urlName,Document doc){
      try {
        //Instantiate a URL object
        URL url = new URL(urlName);

        doc.load(url);
      }//end try

      catch (ParseException e) {
        System.out.println("ParseException");
        doc.reportError(e, System.out);
      }//end catch

      catch(Exception e){
        //Catch and display any top-level exceptions
        System.out.println("Exception");
        System.out.println(e);
      }//end catch
  }//end loadDocument()

The next fragment implements a recursive algorithm to traverse the element tree, beginning at the root node Element object passed in as theRoot.

The parentLevel parameter is used to keep count of the recursion level.  This method causes each node and leaf in the tree to be processed.  In this case, processing means to extract all the information about each node and display it in an indented tree format.  The count of recursion level is used to decide how far to indent for each node or leaf being processed and displayed.

I will assume that you understand recursion and will make no attempt here to explain how recursion works.

We begin the recursion process by incrementing and saving the recursion counter.

Next, we get an Enumeration object that can be used to iterate on all of the Element nodes at this level.
 

Caution.  Here is a possible point of confusion:  The method getElements() is a method declared in the Enumeration interface and has  nothing to do with the MSXML class named Element. It is strictly coincidental that we are using the  getElements() method of the Enumeration interface to iterate on an Enumeration object representing objects of type Element.


Next, we create a reference variable named refToNextElement to refer to the next Element object as we use a while loop to iterate on the list of Element nodes at this level.

Inside the while loop, we get and process the next element in the list of Element nodes at this level.  We must downcast from type Object to type Element.  Recall that Element is a class of the MSXML package. Object is the very top class in the Java class hierarchy.

Also inside the while loop, we invoke the processAnElement() method to display information about the Element node at this level in the tree.

Finally, we make a call to traverseTheTree() to recursively process the children of this Element node.  If this is a leaf node with no children, this call will return immediately because the conditional clause (enum.hasMoreElements()) in the while loop will be false on the first attempt to iterate.
 

  void traverseTheTree(Element theRoot,int parentLevel){

    int thisRecursionLevel = ++parentLevel;
  
    Enumeration enum = theRoot.getElements();
    
    Element refToNextElement;
    while(enum.hasMoreElements()){
      refToNextElement = (Element)enum.nextElement();
      processAnElement(refToNextElement,thisRecursionLevel);
      traverseTheTree(refToNextElement,thisRecursionLevel);
    }//end while loop
  }//end traverseTheTree

The next fragment shows the method where the code to process the data would normally be placed.

In this simple program, the only processing being done is to extract information about the elements and the attributes and to display the information in an indented tree format.

XML parsers have an element type WHITESPACE that represents all of the extraneous whitespace between elements in the raw XML file.  Normally, this is not of interest.

This processing algorithm purposely discards all elements of the WHITESPACE type to reduce clutter in the display.

A reference to the Element object representing the element to be processed in this method is passed in as the first parameter.

The recursion level is passed in as the second parameter.

The recursion level is used to display the required amount of indentation. Each level of indentation is indicated by one '-' character in the output to make the indentation easy to see on the screen.

This is a fairly long method so we will break it up and discuss it in parts.

The first fragment from this method shows the method signature along with an initial if statement used to bypass initial processing if the type of the element is WHITESPACE.

The getType() method of the Document class is used to get the type of the element.  (Here we are talking about element and type in the sense of the XML markup.) This type is then compared with a class constant of the Element class to determine if the type is WHITESPACE.  If it is not WHITESPACE, a for loop is used to provide the initial indentation in the output consistent with the incoming level parameter.
 

  void processAnElement(
                       Element elementToProcess,int level){

    if(elementToProcess.getType() != Element.WHITESPACE)
      for(int cnt = 0; cnt < level; cnt++)prnt("-");

There are a large number of possible element types identified by the parser, all of which exist as class constants in the Element class.

The next code fragment uses a switch statement to switch to the proper code to process each of the possible types provided by the MSXML parser.

In this simple case, the only processing being performed is to print the type and then print other information about the element such as the name of the tag and the name and value of attributes.
 

      

    switch(elementToProcess.getType()){
      case Element.CDATA:prnt("CDATA ");break;
      case Element.COMMENT:prnt("COMMENT ");break;
      case Element.DOCUMENT:prnt("DOCUMENT ");break;
      case Element.DTD:prnt("DTD ");break;
      case Element.ELEMENT:prnt("ELEMENT ");break;
      case Element.ELEMENTDECL:prnt("ELEMENTDECL ");break;
      case Element.ENTITY:prnt("ENTITY ");break;
      case Element.ENTITYREF:prnt("ENTITYREF ");break;
      case Element.IGNORESECTION:prnt("IGNORESECTION ");
                                                  break;
      case Element.INCLUDESECTION:prnt("INCLUDESECTION ");
                                                    break;
      case Element.NAMESPACE:prnt("NAMESPACE ");break;
      case Element.NOTATION:prnt("NOTATION ");break;
      case Element.PCDATA:prnt("PCDATA ");break;
      case Element.PI:prnt("PI ");break;
      case Element.WHITESPACE:break;//Ignore whitespace
      default: System.out.println("default ");
    }//end switch

Once the switch statement has been executed to display the type of element, control moves on to code that is used to display additional information about the element and its attributes.

At this point, we have three possible decisions to make:

  1. Identify the element as WHITESPACE and continue to ignore it.
  2. Identify the element as type PCDATA and display the text value of the element contained in the PCDATA.
  3. Identify the element as neither of the above; get and display the name of the tag that represents the element.

.

    if(elementToProcess.getType() == Element.PCDATA)
      System.out.println(elementToProcess.getText());
    else if(elementToProcess.getType() 
                                     != Element.WHITESPACE)
      System.out.println(
                      elementToProcess.getTagName() + " ");

That completes the processing of the information that represents the element proper.

Now we need to process the attributes for this Element object. The scheme should be a familiar one by now.

We start by getting an Enumeration object describing the attributes of this Element object.

We then loop and process each attribute on the list of attributes represented by the Enumeration object.  Once more, note that the method named  hasMoreElements() refers to generic  elements of an Enumeration object and does not refer to type Element which is a class in the MSXML package.

Inside the loop, we get and downcast each attribute element from type Object to type Attribute which is a class of the MSXML package.

A for loop inside the while loop is used to provide the proper amount of indentation based on the recursion level counter discussed earlier.

Finally, the getName() and getValue() methods of the Attribute class are used to get and display the name and value of each attribute.

Except for one completely trivial method named prnt() used to send strings to the standard output device, that completes the discussion of the program.
 

    Enumeration attrEnum = 
                          elementToProcess.getAttributes();

    while(attrEnum.hasMoreElements()){
      Attribute attribute = 
                         (Attribute)attrEnum.nextElement();
      for(int cnt = 0; cnt < level; cnt++)prnt("-");

      prnt("**Attr Name=" + attribute.getName() + " ");
      System.out.println("Value=" + attribute.getValue());

    }//end while loop
  }//end processAnElement()

The following code fragment is a convenience method named prnt() that invokes the System.out.print method to display the String object passed in as a parameter.  This method was used to  reduce the clutter in the switch statement in the processAnElement() method.
 

  void prnt(String data){
    System.out.print(data);
  }// end method printIt()

The last code fragment shows the closing brace on the controlling class for this program.  A complete unfragmented listing of the program is provided in the next section.
 

}//end class XMLParse01

.

Complete Program Listing

This section contains a complete listing of the program with extensive comments.
 

/*File XMLParse01.java
Revised 5/2/98
This program will parse an XML file with or without a DTD
and display an indented element tree on the standard 
output device identifying all of the elements along with
the attributes and their values for those elements.

If there is no DTD provided, the validation step is 
skipped.

If a DTD is provided, the XML file is validated.

Validation or parsing errors are displayed on the standard
output device.

This program uses a hard-coded file name for the XML file
but it wouldn't be difficult to upgrade it to use a command
line argument for the file name.

This program uses the Microsoft XML parser to parse the
XML file and build the tree.  The code added by this author
traverses the tree and formats it for display on the 
standard output device.

Tested using JDK 1.1.5 and MSXML parser V1.8 under Win95
**********************************************************/

import java.util.*;
import java.net.URL;

//Import MSXML parser classes
import com.ms.xml.om.*;
import com.ms.xml.util.*;
import com.ms.xml.parser.*;

class XMLParse01{//controlling class

  public static void main(String[] args){  
    //Instantiate an object of this type
    XMLParse01 thisObj = new XMLParse01();
    
    //Instantiate an object of the Document class which is
    // one of the classes in the MSXML package.  
    Document doc = new Document();
    
    //Specify the file to be parsed on the local hard drive
    // as a String object compatible with the constructor
    // for a URL object
    String urlName = 
                   "file:///c:/Baldwin/JavaProg/Combined/"
                                   +"Java/XMLParse01a.xml";

    //Load the specified document using the loadDocument()
    // method defined below.  Successful load will result
    // in a Document object containing an element tree
    // being referenced by the reference variable named
    // doc.
    thisObj.loadDocument(urlName,doc);

    //Now begin processing the Document object.
    
    //Use MSXML method to get the root element of the
    // element tree.
    Element theRoot = doc.getRoot();
    
    //Invoke the processAnElement() method to display the
    // root element. In this simple case, the processing
    // method is simply a method that displays the element
    // nodes in an indented tree on the standard output
    // device.
    thisObj.processAnElement(theRoot,0);
    
    //Recursively traverse the element tree extracting
    // and displaying information about the elements and
    // their attributes along the way.
    thisObj.traverseTheTree(theRoot,0);
    
    //Display termination message
    System.out.println("end");
  }//end main
  //-----------------------------------------------------//

  //This method receives a reference to a MSXML Document
  // object along with a String object containing the
  // identification of an XML document file compatible
  // with the URL constructor. When this method returns
  // successfully, an object containing an element tree
  // describing the XML document will be referenced by
  // the Document reference variable created in main()
  // and passed to this method as a parameter.
  void loadDocument(String urlName,Document doc){
      try {
        //Instantiate a URL object
        URL url = new URL(urlName);
        //Invoke the MSXML load method on the MSXML 
        // Document object passing the URL object as
        // a parameter.  If successful, this creates
        // the tree of Element objects.
        doc.load(url);
      }//end try
      catch (ParseException e) {
        System.out.println("ParseException");
        //Use the reportError() method of the MSXML package
        // to report on the type of error.  This is 
        // invoked whenever the XML file cannot be 
        // validated or parsed.
        doc.reportError(e, System.out);
      }//end catch
      catch(Exception e){
        //Catch and display any top-level exceptions
        System.out.println("Exception");
        System.out.println(e);
      }//end catch
  }//end loadDocument()
  //-----------------------------------------------------//

  //This method implements a recursive algorithm to 
  // traverse the element tree which begins with the 
  // Element object passed in as theRoot.  The parentLevel
  // parameter is used to keep count of the recursion level
  // which is used to properly indent the material being
  // displayed.
  void traverseTheTree(Element theRoot,int parentLevel){
    //Increment parentLevel to get new recursion level
    int thisRecursionLevel = ++parentLevel;
    
    //Get an Enumeration object that can be used to
    // iterate on all of the Element nodes at this level.
    // Note that the method getElements() is a method
    // declared in the Enumeration interface and has 
    // nothing to do with the MSXML class named Element.
    // It is coincidental that we are using the 
    // getElements() method of the Enumeration interface
    // to iterate on an Enumeration object representing
    // objects of type Element.
    Enumeration enum = theRoot.getElements();

    //Create a reference variable to refer to the next
    // Element objects we iterate on the list of Element
    // nodes at this level.    
    Element refToNextElement;
    while(enum.hasMoreElements()){
      //Get and process the next element in the list of
      // Element nodes at this level.  Downcast from Object
      // to type Element.  Element is a class of the MSXML
      // package.
      refToNextElement = (Element)enum.nextElement();
      
      //Invoke the processAnElement() method to display
      // information about this Element node in the tree.
      processAnElement(refToNextElement,thisRecursionLevel);
      
      //Recursively process the children of this Element
      // node.  If this is a leaf node with no children,
      // this call will return immediately because the
      // conditional (enum.hasMoreElements()) in the
      // while loop will be false on the first attempt
      // to iterate.
      traverseTheTree(refToNextElement,thisRecursionLevel);
    }//end while loop
  }//end traverseTheTree
  //-----------------------------------------------------//

  //This is where the code to process the data would
  // normally be placed.  In this simple program, the
  // only processing is to extract information and display
  // it in an indented tree format.  XML parsers have an
  // element type WHITESPACE that represents all of the
  // extraneous whitespace between elements in the XML
  // file.  This display algorithm purposely discards
  // all elements of that type to reduce clutter in the
  // display. The element to be processed in this method
  // is passed in as the first parameter.  The recursion
  // level is passed in as the second parameter.  It is
  // used to display the required amount of indentation.
  // Each level of indentation is indicated by one '-'
  // in the output to make it easy to see on the screen.
  void processAnElement(
                       Element elementToProcess,int level){
    //Ignore whitespace
    if(elementToProcess.getType() != Element.WHITESPACE)
      //Display the required indentation indicator
      for(int cnt = 0; cnt < level; cnt++)prnt("-");
      
    //Use a switch statement to switch to the proper
    // code to process each of the possible types
    // provided by the MSXML parser. In this simple case,
    // just print the type.  Then jump to the special
    // code provided outside the switch statement for
    // processing type PCDATA.
    switch(elementToProcess.getType()){
      case Element.CDATA:prnt("CDATA ");break;
      case Element.COMMENT:prnt("COMMENT ");break;
      case Element.DOCUMENT:prnt("DOCUMENT ");break;
      case Element.DTD:prnt("DTD ");break;
      case Element.ELEMENT:prnt("ELEMENT ");break;
      case Element.ELEMENTDECL:prnt("ELEMENTDECL ");break;
      case Element.ENTITY:prnt("ENTITY ");break;
      case Element.ENTITYREF:prnt("ENTITYREF ");break;
      case Element.IGNORESECTION:prnt("IGNORESECTION ");
                                                  break;
      case Element.INCLUDESECTION:prnt("INCLUDESECTION ");
                                                    break;
      case Element.NAMESPACE:prnt("NAMESPACE ");break;
      case Element.NOTATION:prnt("NOTATION ");break;
      case Element.PCDATA:prnt("PCDATA ");break;
      case Element.PI:prnt("PI ");break;
      case Element.WHITESPACE:break;//Ignore whitespace
      default: System.out.println("default ");
    }//end switch

    //Special code for processing type PCDATA
    if(elementToProcess.getType() == Element.PCDATA)
      //Get and display the text value of type PCDATA
      System.out.println(elementToProcess.getText());
    else if(elementToProcess.getType() 
                                     != Element.WHITESPACE)
      //If not PCDATA and not WHITESPACE, get and display
      // the name of the tag for this element
      System.out.println(
                      elementToProcess.getTagName() + " ");
      
    //Now process the attributes for this Element object.
    //Get an Enumeration object describing the attributes
    // of this Element object.
    Enumeration attrEnum = 
                          elementToProcess.getAttributes();
    
    //Loop and process each attribute individually.  Note
    // the method hasMoreElements() refers to generic 
    // elements and does not refer to type Element as
    // used elsewhere in this program.
    while(attrEnum.hasMoreElements()){
      //Get and downcast the next attribute element in the
      // list of attribute elements.
      Attribute attribute = 
                         (Attribute)attrEnum.nextElement();
                         
      //Indent the proper amount for the display
      for(int cnt = 0; cnt < level; cnt++)prnt("-");
      
      //Get and display the name and value of each 
      // attribute in the list of attributes for this 
      // object of type Element in the tree of objects of
      // type Element.
      prnt("**Attr Name=" + attribute.getName() + " ");
      System.out.println("Value=" + attribute.getValue());

    }//end while loop
  }//end processAnElement()
  //-----------------------------------------------------//
  
  //This is simply a convenience method that invokes the
  // System.out.print method to display the String object
  // passed in as a parameter.  This method was used to 
  // reduce the clutter in the switch statement in the
  // processAnElement() method
  void prnt(String data){
    System.out.print(data);
  }// end method printIt()
}//end class XMLParse01
//-------------------------------------------------------//

-end-