What is SAX, Part 3

by Richard G. Baldwin
baldwin@austin.cc.tx.us
Baldwin's Home Page

Dateline: 06/14/99

prolog

In Part 2 of this series of articles on SAX, I promised to show you how to write a Java program that uses XML4J to parse a simple XML document.

I promised that the program will deliver a series of events to the appropriate event handler methods as the parser traverses the XML document, and that the event handler methods will extract and display information about the XML document.

This article

Discusses the general aspects of the program
Shows the output
Discusses the output

I will continue the discussion in the next article where I will show the actual Java code used to parse the XML file and to respond to and handle SAX parser events.

the XML File, a short book of poems

The XML file is shown below:

<?xml version="1.0"?>

<bookOfPoems>

<poem PoemNumber="1" 
      DummyAttribute="dummy value">
<line>Roses are red,</line>
<line>Violets are blue.</line>
<line>Sugar is sweet,</line>
<line>and so are you.</line>
</poem>

<poem PoemNumber="2"
      DummyAttribute="dummy value">
<line>Twas the night before Christmas,</line>
<line>And all through the house,
<line>Not a creature was stirring,</line>
<line>Not even a mouse.</line>
</poem>

</bookOfPoems>

As you can see from the above listing, the XML file used with this sample program represents the rudimentary aspects of a book of poems. It contains one verse each from two well-known poems.

the element structure

Sometimes I find it easier to visualize the overall element structure of an XML document by removing everything but the tags. The following is a representation of the element structure with the attributes and the content of each element removed.

<?xml version="1.0"?>
  <bookOfPoems>
    <poem>
      <line></line>
      <line></line>
      <line></line>
      <line></line>
    </poem>
    <poem>
      <line></line>
      <line>
      <line></line>
      <line></line>
    </poem>
  </bookOfPoems>

the first poem was correct

The XML markup for the first poem was correct from a syntax viewpoint.

an XML syntax error

A syntax error was purposely introduced into the second poem to illustrate the error-handling capability of SAX and the IBM parser.

The error is highlighted in bold in the listing shown above. The highlighted element is missing its end tag (</line>).

handling parser events and errors

This program uses the IBM Parser for Java (XML4J) along with the XML file shown above to illustrate the trapping and handling of parser events along with customized error handling.

purpose of the program

The purpose of the program was to

Traverse the XML file
Display the elements
Display the attributes
Display the text of the poems.

As mentioned earlier, the first poem had the correct XML syntax. The second poem was purposely missing an end tag midway through the poem.

processing results

The program was tested using JDK 1.2 from Sun under Win95 using the XML4J version 2.0 parser from IBM.

I manually inserted some line breaks to force the output material shown below to fit in this format. I also deleted some blank lines to reduce the overall size of the output listing.

The first part of the output from the program is shown below. This part deals only with the beginning of the Document element, the beginning of the bookOfPoems element, and the first poem element. A later section of output deals with the remainder of the XML file.

If you compare this output with the raw XML document shown above, you will see that the first poem was parsed and displayed successfully. The output produced by the program included

The beginning and ending of each element
The element names
The attribute values for the elements
The contents of each element (the text of the poem)

Start Document
Start element: bookOfPoems

Start element: poem
Attribute: PoemNumber, Value = 1, Type = CDATA
Attribute: DummyAttribute, Value = dummy value, 
           Type = CDATA

Start element: line
Roses are red,
End element: line

Start element: line
Violets are blue.
End element: line

Start element: line
Sugar is sweet,
End element: line

Start element: line
and so are you.
End element: line

End element: poem

Each portion of output was the result of an event handler being invoked by the parser. Each event handler extracted and displayed information about that portion of the XML document with which it was concerned when it was invoked.

For example, the first line of output that reads Start Document was the result of the parser detecting the beginning of the document and invoking the appropriate event handler.

Except for the Document element, and the bookOfPoems element, the result of detecting the beginning and the end of each element was included in the output shown above.

The endings of the Document and bookOfPoems elements are not shown above because, as mentioned earlier, this output does not describe the entire document. This output only describes the beginning of the Document, the beginning of the bookOfPoems element, and the first poem element. Additional output is shown later.

handling the XML syntax error

As mentioned earlier, a syntax error was purposely introduced into the second poem in the XML file. The second poem was displayed as shown below. (This output is a continuation of the output shown above.)

I highlighted the line with the missing end element using boldface in the following output so that you can see where the problem occurs.

Attribute: PoemNumber, Value = 2, Type = CDATA
Attribute: DummyAttribute, Value = dummy value, 
           Type = CDATA

Start element: line
Twas the night before Christmas,
End element: line

Start element: line
And all through the house,

Start element: line
Not a creature was stirring,
End element: line

Start element: line
Not even a mouse.
End element: line

systemID: 
file:/G:/Baldwin/AA-School/JavaProg/Combined
    /Java/Sax01.xml
[Fatal Error] 
Sax01.xml:17:7: "</line>" expected.
Terminating

Note that a fatal error occurred at the point where the parser was able to determine that the end tag was missing from one of the lines in the poem.

The error was detected and error processing began following the last line in the second poem. The output from error processing began with the line that reads systemID: (also highlighted in boldface).

As you can see from the positions of the two sets of boldface characters, this determination was not made until several lines beyond the actual missing tag. A customized error message was produced showing the line number and character number where the error was detected along with the nature of the error.

a non-validating parser was used

This delay in detecting the problem resulted from the fact that no DTD was provided and a non-validating parser was used. (Actually, the XML4J parser was used in its non-validating mode.) Therefore, the parser initially believed that the appearance of a start tag ahead of an expected end tag indicated a nesting condition. It wasn't until the parser was later able to determine that this was not an allowable nesting condition that it was able to determine that there was a missing end tag.

Presumably, if there had been a DTD specifying that <line> tags may not be nested inside of <line> tags, a validating-parser would have recognized the error as soon as it occurred. If I have the time, I will try to demonstrate this in a subsequent article.

coming attractions...

My plan for the next article is to continue the discussion of this program. I will show you the actual Java code in the program that was used to produce the output shown above.

the XML octopus

Trying to wrap your brain around XML is sort of like trying to put an octopus in a bottle. Every time you think you have it under control, a new tentacle shows up. XML has many tentacles, reaching out in all directions. But, that's what makes it fun. As your XML host, I will do my best to lead you to the information that you need to keep the XML octopus under control.

Credits

This HTML page was produced using the WYSIWYG features of Microsoft Word 97. The images on this page were used with permission from the Microsoft Word 97 Clipart Gallery.

311144

About the author

Richard Baldwin is a college professor and private consultant whose primary focus is a combination of Java and XML. In addition to the many platform-independent benefits of Java applications, he believes that a combination of Java and XML will become the primary driving force in the delivery of structured information on the Web.

Richard has participated in numerous consulting projects involving Java, XML, or a combination of the two. He frequently provides onsite Java and/or XML training at the high-tech companies located in and around Austin, Texas. He is the author of Baldwin's Java Programming Tutorials, which has gained a worldwide following among experienced and aspiring Java programmers. He has also published articles on Java Programming in Java Pro magazine.

Richard holds an MSEE degree from Southern Methodist University and has many years of experience in the application of computer technology to real-world problems.

baldwin@austin.cc.tx.us
Baldwin's Home Page

-end-