by Richard G. Baldwin
Dateline: 05/16/99 prolog
Baldwin's Home Page
so, what is XML?
the alphabet soup
from whence does XML derive?
what is the purpose of XML?
According to theFAQ mentioned above:
"XML is an abbreviated version of SGML, to make it easier for you to define your own document types, and to make it easier for programmers to write programs to handle them. It omits the more complex and less-used parts of SGML in return for the benefits of being easier to write applications, easier to understand, and more suited to delivery and interoperability over the Web. But it is still SGML, and XML files may still be parsed and validated the same as any other SGML file ... "
should I care?
XML is very important to Web development because it will allow the development of user-defined document types on the Web. It breaks the mold of fixed-format document types imposed by HTML while eliminating much of the complexity associated with SGML.
but, you say, creating a DTD is difficult
While XML makes SGML simpler, it retains the abilities of SGML that let you define your own document types. In addition, unlike SGML, XML introduces a new feature that eliminates the requirement for a specific DTD. While XML makes it possible for a document to have a DTD, this option also makes it possible for a document to define its on type "on the fly" without a requirement for a DTD.
This leads us to introduce two important terms (valid documents and well-formed documents) which I will discuss later.
how universal is XML?
"XML will allow groups of people or organizations to create their own customized markup languages for exchanging information in their domain (music, chemistry, electronics, hill-walking, finance, surfing, linguistics, mathematics, knitting, history, engineering, rabbit-keeping etc).
HTML is at the limit of its usefulness as a way of describing information, and while it will continue to play an important role for the content it currently represents, many new applications require a more robust and flexible infrastructure."
caution: XML is case sensitive
Case sensitivity has historically been a stumbling block for many aspiring C and C++ programmers and it could be just as troublesome for you. (It is only a stumbling block because the language that those programmers were taught originally was not case sensitive, Of course, HTML is also not case sensitive.) In short, you need to be concerned about the case of everything that you enter into an XML document.
can I convert an HTML document to an XML document?
the old angle bracket problem
attributes must be in quotes
remember case sensitivity
help is available
one more requirement
"If you have created your HTML files conforming to one of the several HTML Document Type Definitions (DTDs), and they validate OK, then they can be converted as follows:
Replace the DOCTYPE declaration and any internal subset (basically everything within the first set of angled brackets
with the XML Declaration
<?XML version="1.0" standalone="yes"? >"
compatibility with SGML?
"At the moment there are few tools which handle XML files unchanged because of the format of these EMPTY elements, but this is changing. The nsgmls parser has an experimental XML conformance switch, and the first XML-specific editors and parsers are appearing (see the question on software)."
back to the Document Type Definition
"A DTD is usually a file (or several files to be used together) which contains a formal definition of a particular type of document. This sets out what names can be used for elements, where they may occur, and how they all fit together. For example, if you want a document type to describe <LIST>s which contain <ITEM>s, part of your DTD would contain something like
<!ELEMENT item (#pcdata)>
This defines items containing text, and lists containing items.
It's a formal language which lets processors automatically parse a document and identify where every element comes and how they relate to each other, so that stylesheets, navigators, browsers, search engines, databases, printing routines, and other applications can be used."
Now let's take a closer look at the terms DTD, valid, and well-formed.
the three parts
One file contains the content of the document (words, pictures, etc.). This is the part that the author wants to expose to the client.
A second file is the DTD, which meets the definition given above.
A third file is a stylesheet that establishes how the content that conforms to the DTD is to be rendered on the output device. This is how the author wants the material to be presented to the client.
For example a tag with an attribute of "red" might cause something to be presented bright red according to one stylesheet and dull red according to another stylesheet. (It might even be presented as some shade of green according to still another stylesheet.)
the two extremes
XML, the middle ground
According to theFAQ, XML does not require a DTD:
"Full SGML uses a Document Type Definition (DTD) to describe the markup (elements) available in any specific type of document. However, the design and construction of a DTD can be a complex and non-trivial task, so XML has been designed so it can be used either with or without a DTD. DTDless operation means you can invent markup without having to define it formally.
To make this work, a DTDless file in effect `defines' its own markup, informally, by the existence and location of elements where you create them. But when an XML application such as a browser encounters a DTDless file, it needs to be able to understand the document structure as it reads it, because it has no DTD to tell it what to expect, so some changes have been made to the rules."
now for valid
what about well-formed?
According to theFAQ:
"For example, HTML's <IMG> element is defined as `EMPTY': it doesn't have an end-tag. Without a DTD, an XML application would have no way to know whether or not to expect an end-tag for an element, so the concept of `well-formed' has been introduced. This makes the start and end of every element, and the occurrence of EMPTY elements completely unambiguous."
all XML documents must be well-formed
<?XML version="1.0" standalone="yes"?>
(However, a couple of XML parsers that I have used don't seem to enforce this requirement.)
to be well-formed...
All attribute values must be in quotes (apostrophes or double quotes). You can surround the value with apostrophes (single quote) if the attribute value contains a double quote. An attribute value that is surrounded by double quotes can contain apostrophes.
dealing with empty elements
(Note that an EMPTY element can contain one or more attributes inside the start tag.)
a subtle problem
The problem here is that this element isn't really empty. It contains a newline character.
markup characters, keep out
The sequence ]]> must be written as ]]> if it does not occur as the end of a section marked as CDATA.
According to theFAQ:
"Well-formed files with no DTD may use attributes on any element, but the attributes must all be of type CDATA by default.
Well-formed XML files with no DTD are considered to have <, >, ', ", and & predefined and thus available for use even without a DTD.
Valid XML files must declare them explicitly if they use them."
Again, some of the parsers that I have used don't seem to require that these items be predefined even with a DTD.
validity and well-formed requirements, recap
XML files must be well-formed, but there is no requirement for them to be valid. A DTD is not required in which case validity is impossible to establish. However, if they have a DTD, they must conform to it, which makes them valid.
where to find the DTD if used
This statement is known as a Document Type Declaration (as distinguished from a Document Type Definition). This particular Document Type Declaration indicates that the outermost containing element in the document begins with a tag named myDocument.
The format of the statement tells where to find a file containing the DTD. The keyword SYSTEM indicates that the DTD is in a separate or external file. There are other formats as well.
An XML document can have an external DTD as above, an internal DTD (the DTD can simply be prepended onto the document), some combination of the two, or no DTD at all. Regardless, an XML document must always be well-formed.
java and XML
Java is a strong contender for the writing of programs that do useful things with XML documents. Anexample is available showing how to convert an HTML document to an XML document, and then to use a Java parser from Microsoft to parse and analyze that document.
Richard G. Baldwin
Trying to get your arms around XML is sort of like trying to put an octopus in a bottle. Every time you think you have it under control, a new tentacle shows up. In other words, XML has many tentacles, reaching out in all directions. But, that's what makes it fun. I will discuss many of the tentacles in upcoming articles.
Credits: These HTML pages were produced using the WYSIWYG features of Microsoft Word 97. The computer image used on this page was used with permission from the Microsoft Word 97 Clipart Gallery.
Copyright 2000, Richard G. Baldwin
Richard has participated in numerous consulting projects involving Java, XML, or a combination of the two. He frequently provides onsite Java and/or XML training at the high-tech companies located in and around Austin, Texas. He is the author of Baldwin's Java Programming Tutorials, which has gained a worldwide following among experienced and aspiring Java programmers. He has also published articles on Java Programming in Java Pro magazine.
Richard holds an MSEE degree from Southern Methodist University and has many years of experience in the application of computer technology to real-world problems.
Baldwin's Home Page