Last Updated:

Let's Move to XML 

Despite the fact that this material may seem difficult for an ordinary user to perceive, I recommend not to scroll through it, but to strain and read. If necessary, come back again and again. This material is written more for developers of Internet applications, but today it is safe to say that the time of XML has come for the average Internet user and computers. The creation of Symantec Web (see "Makes Sense - Symantec Web") finally entrenched in the minds of software developers the idea that without XML there is nowhere else, which means that the programs they create will be as xml-ziz as possible. And for whom are the programs written? That's right, for you and me. That means we can use XML to the fullest. Don't get off the train, it's not too late.

Table of Contents

If you're a Web developer, you have to deal with a variety of technologies — Netscape plug-ins, ActiveX controls, Dynamic HTML, Cascading Style Sheets (CSS), etc. — to extend the capabilities of your pages. In a few cases, you actually got what you promised, but mostly these technologies only seriously complicated your life due to their inconsistent behavior in different browsers.

As one of the victims, I have to admit that in the end, my reaction to browser extensions was exactly the same as to a migraine headache: turn off the lights, pull back the curtains, lie on the bed and wait for it to pass.

However, Extensible Markup Language (XML) is a different matter. Although, like any new technology, it requires mastery, it should not cause you a migraine. XML came in earnest and for a long time. The main thing is that it should make your life easier, not harder.

The most important feature of XML and its accompanying Extensible Stylesheet Language (XSL) technology is the separation of formatting from content. This may seem familiar to anyone who has worked with CSS or style sheets in Microsoft Word. However, if standard HTML is likened to a photograph of a building, the CSS will follow the instructions for the photo lab on how to process the photo. All doors can be made red, all walls - pink, and the roof - gray. However, without access to a photocopy of the building, no fundamental changes can be made. XML, unlike HTML, allows you to expose and manipulate data.

HTML at the end of the road

The beauty of XML can only be understood by comparing it to HTML. Formalized in RFC 1866 in 1995 (although naturally it began to be used earlier), HTML is the most popular markup language worldwide. The term <> in relation to a document usually means everything that does not relate to its content. For example, when this article was being prepared for publication, the editors of Network Magazine marked it up (using good old <analog> red fountain pen), inserting comments for the author and instructions for layout designers on how to format various elements.

Surely all Web users have seen an HTML file in its original form, where formatting tags are mixed with plain text. (Some of you may recall WordStar, which also used mostly paired tags in this regard; on the days of text monitors, a document could easily be messed up when, when you inserted an opening tag to go to bold or underline, you then forgot to turn it off by inserting a closing tag at the end of a word.)

The main feature of HTML markup is, of course, the ability to insert links to external documents or to internal sections of the same document. It is worth noting that although HTML is most often provided by servers over HTTP, it can also be used on a CD-ROM or on a local network. Universal markup languages are not tied to any particular transport.

HTML has succeeded not only as an adaptable markup language, but also as a middleware (see D. Angel's article <Promotional Software> in this LAN number). Because of their cheapness and prevalence, Web browsers are great customers; through HTML, they can communicate with a wide variety of servers.

However, HTML has encountered certain difficulties. Its limited formatting capabilities have been overcome with CSS, Bitstream's TrueDoc initiative, and of course a host of browser-specific extensions; and its limited capabilities as middleware - using Java, ActiveX, etc. However, all this does not eliminate its fundamental shortcomings.

What you see is all you get.

In fact, HTML is a technology for presenting information, it describes how the browser should arrange the text and graphics on the page. As a result, <the thing you see is all you get>. There is no way to describe the data regardless of the display of this data (except for the extremely weak keyword system in the title of the Web page). This is the main reason why it is so difficult to find the information you need using the search engine.

The client does not have any more or less acceptable means of extracting data from the Web page for further work with them. With a firm hand, you can insert the contents of an HTML table into a spreadsheet, but that's not the solution! Further, on any given Web page, the client receives only one representation of a particular set of data.

Suppose you're viewing a list of eBay auctions ordered by the opening date of the auction. If you want to look at the same list, but sorted by the closing date of trading, then your browser will have to send a new request to the server. In turn, the server will have to re-send a full HTML page with a list of auctions. This kind of data manipulation leads to a significant increase in the number of accesses to Web servers and thus makes it difficult to further scale them.

Another problem with HTML is that it is <baby> language, meaning authors cannot use it to provide information about the data hierarchy. Further, it is inconsistent and therefore makes it difficult for the software to parse the text. For example, while most opening tags (such as <B> or <H1>) have corresponding closing tags, some (such as <P>) do not.

A simple solution to some of these problems would be to introduce additional HTML tags, such as <NAME>, <DATE> or <PRICE>. With their help, the client could define what the data is and display it in different ways or export it at the user's request. History, however, shows that introducing additional tags for HTML can take years; Consensus on what they should mean is rarely quickly reached, if at all. If you decide not to wait for the standard to change, then keep in mind that you are creating something of your own, non-standard and thereby abandoning one of the main advantages of HTML.

Therefore, in 1996, members of the World Wide Web Consortium (W3C, http://www.w3.org) working group returned to the consideration of the Standard Generalized Markup Language (SGML), of which HTML is a highly simplified descendant. Proposed in 1974 by Charles Goldfarb, SGML is a metalanguage – a system for describing other languages. For all its capabilities, it is too complex for most Web browsers: the SGML specification alone spans over 500 pages.

Simplifying SGML for use with the Web, the group proposed XML (the W3C's February 1998 status recommendation). XML is a subset of SGML, with any valid XML document being a valid SGML document. And, like SGML, XML is a metalanguage that defines other markup languages for specific purposes. For example, the Synchronized Multimedia Integration Language (SMIL) is based on XML.

XML is used to mark standard documents in much the same way as HTML. However, XML excels at working with structured data, such as query results, meta-information about a Web site, or schema elements and types.

An XML document looks a lot like HTML. It also consists of text fragments annotated with angle bracketed tags. However, unlike HTML, the meaning of the tag is case-sensitive, and each opening tag must always have a paired closing tag.

Calling a Spade a Spade

Without restricting the author to any fixed set of tags, XML allows the author to enter any names that seem useful. This capability is key to actively manipulating data. As an example, I'll give you a comparison of how a list of names and addresses looks in HTML and how it will be represented in XML.

Here's a snippet of HTML:

<H1>Editor Contacts</H1>
<H2>Name: Peter</H2>
<P>Position: Senior programmer</P>
<P>Edition: pc Magazine</P>
<P>Street and house: abdcd ff</P>
<P>City: San Francisco</P>
<P>State: California</P>
<P>Index: 25145</P>
<P>Email:
coder@yourdomain.com</P>

Tags place data on the screen, but don't tell you anything about its structure. Of course, you can figure out their structure yourself and even insert a long list of records into a spreadsheet, but what happens if one of the records doesn't contain lines with an email address or the street and city names are mixed up?

In the case of XML, the same fragment will be represented as follows (and saved in the EDITORS.XML file):

<?xml version="1.0" ?>
<editor_contacts>
<editor>
<first_name>Jonathan</first_name>
<last_name>Angel</last_name>
<title>senior editor</title>
<publication>Network
Magazine</publication>
<address>
<street>600 Harrison </street>
<city>San Francisco</city>
<state>California</state>
<zip>94107</zip>
</address>
<e_mail>coder@yourdomain.com</e_mail>
</editor>
</editor_contacts>

XML, which is only slightly more <mplicit> than HTML, makes it much easier to define what data fields are and where they are located. In XML, tags cannot be superimposed as in HTML (which is discouraged, but allowed by most HTML parsers). However, they can be nested within each other. In fact, nesting is even encouraged as a way to create a hierarchy of data (subordinate or equal relationships). As you can see from the above example, elements such as <first_name> and <e_mail> contain data, while others (<address>) are present for structuring purposes only.

Element start and end tags are the main markups used in XML, but they are not limited to them. For example, elements can be assigned attributes. This capability is similar to html, where, for example, the <table element> can be assigned the align="center" attribute. In XML, an element can have one or more attributes associated with it, and when composing a document, you can invent as many as you want, for example, <publication topic="networking" circulation="controlled">.

XML documents can contain references to other objects. References are a string that begins with an ampersant and ends with a semicolon. These links allow, inter alia, the insertion of special characters into the document, the inclusion of which in themselves could confuse the parsing program. For example, to put a sign <silt than> (<) in a document, you have to insert a link <, and to insert the ampersant itself - a link &, etc. So far, everything is the same as in HTML. However, XML object references provide much more power because they can reference author-defined sections of text in the same document or in a different document.

For example, object references allow you to take an object-oriented approach when creating a journal article:

<article>
&introduction;
&body;
&sidebar;
&conclusion;
&resources;
</article>

Other types of XML markup are comments (they stand out just like in HTML) and processing instructions. Processing instructions are preceded by a question mark. They describe what a parse program should use to interpret a particular document or section of it. For example, the statement <?xml version = 1.0"?> tells the XML parser that the document being processed is indeed composed using XML. On the other hand, the <?rtf \page?> statement is used to invoke the RTF parser and insert the end-of-page character.

Finally, sections of character data are parts of a document that are treated solely as character data that are not parsed. They look like this:

<![CDATA[

This text, even if it contains HTML code elements such as <B>fast</B> or <H1>heading</H1>, is not grammatical. Instead, it appears as is.

]]>

Style sheets

So far, I've sidestepped two important issues when discussing XML. The first is how XML elements should be formatted. (You've probably tried, but in vain, to find formatting instructions in the code snippets provided.) The second has to do with how browsers can understand non-standard tags like <publication>.

The answer lies in the use of style sheets. Cascading Style Sheets (CSS), which are moderately popular on the Web, allow you to change the formatting of well-known HTML tags and define new tags. Specifically, the Network Magazine Web server uses CSS style sheets to standardize the presentation of typical elements, such as <H1>, and to introduce new ones, such as sidebars.

CSS can also be used to format XML documents, but this is not a good choice. The main advantage of XML is that it represents the document format, for possible manipulation, in the form of a tree structure. Unfortunately, CSS is not able to interact with the tree and can only format XML documents <as they are>. You can display a document in any format you like, but you can't selectively present its data without using a scripting language. What's more, to use CSS, you'll have to learn another syntax.

These limitations led to the creation of XSL. It is an XML application with its own semantics (a fixed set of elements), so it can be used to create style sheets (document templates) that any XML parser can understand.

XSL style sheets describe how XML documents should be converted to other formats, such as HTML or RICH TEXT. But XML style sheets are more than just format converters; they also provide a mechanism for manipulating data. For example, data can be sorted, searched, deleted, or added directly from the browser.

Let's look at any simple style sheet that we could use for the Editor Contacts application presented earlier.

<?xml version="1.0" ?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/TR/WD-xsl">
<!-declaration that the document is a style sheet and that it is associated with xsl: namespace ->
<xsl:template match="/">
<!-Apply template to everything in source XML document ->
<HTML>
<BODY>
<H1>Editor Contacts</H1>
<xsl:for-each select="editor_contacts/editor">
<H2>Name: <xsl:value-of select="first_name">
<xsl:value-of select="last_name"/></H2>
<P>Title: <xsl:value-of select="title"/></P>
<P>Publication: <xsl:value-of select="publication"/></P>
<P>Street Address: <xsl:value-of select="address/street"/></P>
<P>City: <xsl:value-of select="address/city"/></P>
<P>State: <xsl:value-of select="address/state"/></P>
<P>Zip: <xsl:value-of select="address/zip"/></P>
<P>Email: <xsl:value-of select="e_mail"/></P>
</xsl:for-each>
</BODY>
</HTML>
</xsl:template>
</xsl:stylesheet>

When saved to disk under the name EDITORS. XSL (or any other) this template will be applied to EDITORS.XML if you add the following line to it after the first:

<?xml-stylesheet
type="text/xsl" href=
"editors.xsl" ?>

Ultimately, the text on the browser screen will look exactly like the HTML fragment presented earlier. However, XSL can act as a function of the merge-print word processor. Defined as an integral part of the XSL namespace, the xsl:for-each element tells the processor that it must cycle through all nodes in the source XML file. The xsl:value-of attribute inserts the value of the XML node into the HTML element. Thus, if you have to go back to EDITORS.XML and paste dozens or hundreds of contact addresses, then they will be displayed in the style sheet without any changes. Because formatting information only needs to be passed once, XML and XSL save bandwidth.

XSL style sheets mimic the merge-print function in that they allow you to selectively omit data fields when displayed. In addition, the output can be sorted by any particular data field. To sort the database of contact addresses by editor's last name in a straight alphabetical order, the xsl:for-each element should be modified as follows:

<xsl:for-each select=
"editor_contacts/editor" 
order-by="+last_name">

XSL is also capable of performing conditional transformation of output depending on the values of various elements or attributes. What's more, it allows you to query data using a wide variety of pattern operators, lookup characters, filters, Boolean operators, and set expressions. XML and XSL are in no way intended to replace SQL, and there are hardly many who want to store their databases directly in XML format. However, XSL opens up the possibility of a variety of data searches after they are loaded into the browser. You'll never need to use the primitive built-in Find browser command to find information.

The considerable potential of XML as middleware is underpinned by the Document Object Model (DOM), version 1.0 of which was adopted as a recommendation by the W3C in October 1998. The DOM originated as a specification for ensuring portability of JavaScript scripts and Java programs between Web browsers and later evolved into an API for HTML and XML documents. It defines the logical structure of documents, how to access and manipulate them. Programmers can create documents, manipulate their structure, and add, modify, or delete elements and content.

The DOM has no effect on how XML and HTML documents should be written. Instead of defining a set of data structures, it presents documents according to an object model, such as a tree structure consisting of nodes. There is no need to use the DOM simply to view XML documents from a browser. It is used when a script requires modifying an XML document or accessing its data. On the server, the DOM can be used to analyze the XML files received from the client and respond to them accordingly. In addition, DOM programmers can use it as an intermediate layer for conversion from a database format to XML. With the correct implementation of DOM interfaces, users will never need to know that data is stored in any other format than XML.

Proclamation of the structure of the document

Okay, okay, I'll admit: moving on to using XML as a middleware, I missed one important step. If XML tags and elements are used solely for convenience on your own Web site (as if you were using CSS), then it doesn't matter that you give these elements and tags names whose meaning is different from the standard and is known only to you. If, on the other hand, you want to provide data to the outside world and receive information from business partners, then this circumstance becomes of great importance. Elements and attributes should be used by you in the same way as all other people, or at least you should document what you are doing.

To do this, you'll need to use Document Type Definitions (DTDs). Stored at the beginning of an XML file or externally as a * file. DTDs, these definitions describe the information structure of the document. DTDs enumerate the possible names of elements, define the available attributes for each type of element, and describe the compatibility of some elements with others.

Each row in a document type definition can contain an item type declaration, name an item, and define the type of data that an item can contain. it looks like this

<!ELEMENT element name
(data type)>

For example, the declaration <! ELEMENT publication (#PCDATA) > defines an element named publication that contains character data (that is, text). Declaration <! ELEMENT special_report (article_1, article_2, article_3) > defines an element named special_report that contains subitems article_1, article_2, and article_3 in the specified order, for example:

<special_report>
<article_1>XML:
the time has come</article_1>
<article_2>XML excels
yourself</article_2>
<article_3>Managing networks and
systems using XML</article_3>
</special_report>

After defining the elements, DTDs can also define attributes using the ! ATTLIST. It specifies an element, names its associated attribute, and then describes its valid values. For example, the following command maps the manufacturer attribute to the car element, the first of which can take one of the specified values:

<!ATTLIST car manufacturer
(AudilVolvoVolkswagen)>

! ATTLIST allows you to manage attributes in many other ways as well: set default values, suppress spaces, etc.DTD can also contain declarations! ENTITY, where object references and declarations are defined! NOTATION, which specifies what to do with non-XML binaries.

A serious and somewhat surprising limitation of DTDs is that they do not allow data to be typed, i.e., they limit data to a specific format (such as date, integer, or floating-point number). As you've probably already noticed, DTDs use a different syntax than XML, and aren't very intuitive. For these reasons, DTDs are likely to be replaced by the more powerful and easy-to-use XML Schemas that are currently being worked on. For more information about XML schemas, see the working draft referenced in the Table and the sidebar <What's your name?>.

You may have heard the definitions of <here> and <reactive> (valid) definitions for XML documents. A document is well-formed if there is a corresponding closing tag for each opening tag and there are no overlay tags. (Thus, most HTML documents are not written correctly.) A document is valid if it contains a DTD and complies with its rules.

XML gets started

XML will be increasingly popular as an open and effective standard for business-to-business collaboration and e-commerce. XML data will be moved primarily by using HTTP, but it can also be propagated using message queuing technologies such as IBM's MQSeries or Microsoft's Message Queue Server.

However, in order for this to be possible, specific schemes need to be defined and implemented in a coherent manner. The W3C quite rightly decided that it should not interfere with this; as a result, dozens of industry standards organizations are defining XML, DTDs, and schemas. These include RosettaNet (focused on IT supply – see details on http://www.rosettanet.org), CommerceNet (http://www.commercenet.com), XML/EDI Group (http://www.geocities.com/WallStreet/Floor/5815/), Open Applications Group (http://www.openapplications.org), XML.ORG (http://www.xml.org), and BizTalk (http://www.biztalk.org).

Microsoft's BizTalk is the most controversial: the company's supporters see it as an altruistic attempt to help make XML, while opponents see it as another attempt to subjugate the industry. (For more information about BizTalk, see <XML Transcends Itself>.)

As I personally believe (ironically, my opinion coincides with the forecast published on the BizTalk Web server), it is unlikely that any industry will be able to implement a common set of semantic rules for different XML schemas. On the other hand, it is possible that all the variety of schemes can be reduced to two or three competing schemes for each industry and then publish maps to adapt these schemes to each other.

Business partners will have no trouble adopting XML and common schemas to communicate with each other. In the case of strong competition between companies such as interactive bookstores, auction servers, etc., they can hold on to their schemes to the last before finally presenting data in a standard way. Ultimately, however, customers must force them to do so. Once XML applications allow searching and price comparisons across different servers, those who refuse to standardize will simply lose their business.

XML has come seriously and for a long time - and it has a lot of advantages. By the way, according to the W3C, under the guise of the recently adopted <boston> addition to the Synchronized Multimedia Integration Language (SMIL), XML can become a key element of digital television broadcasting.

It may be too early to completely remake your Web server for XML. However, it is time to start working with it, especially since the necessary tools are already available. From an end-user perspective, Microsoft's Internet Explorer 5.0 supports XML, XSL, DTD, and XML Schemas, and Netscape Navigator/Mozilla 5.x will do so after its release.

Internet Resources

Tim Bray, co-editor of the Extensible Markup Language (XML) 1.0 specification, wrote an excellent introduction to XML <XML and the second generation of Web> for Scientific American. It can be read on the http://www.sciam.com/1999/0599issue/0599bosak.html. His article <Extending the Concept of a Document> can be found on the Web Techniques magazine server at http://www.webtechniques.com/archives/1998/12/bray/.

The World Wide Web XML home page, with links to introductory articles, FAQs and related standards, is located at http://www.w3.org/XML/.

If you want to keep up to date with developments in the field of XML, then visit the http://www.xml.com published by Seybold Publications and O-Reilly.

XML Schemas

Designed to replace Document Type Definitions (DTDs). Cm. W3C working project on http://www.w3.org/TR/xmlschema-1/.

What's in your name?

Extensible Markup Language (XML) allows you to create your own tags, document them using Document Type Definitions (DTDs) or XML Schemas, and then easily exchange data with other sources. All of this is good, but you may find that others use the same names for elements and attributes as you, but rely on different DTDs.

Turning to the popular bookstore example, it's almost certain that both Know Knew Books and Amazon.com will use tags with names like author, title, isdn, and price. At the same time, it is unlikely that they will use the same DTDs. This is a direct path to problems.

To avoid such conflicts, the W3C developed the concept of namespaces and the xmlns keyword. Thanks to them, the names of elements and attributes that would otherwise conflict with each other can be used in one document. Now they are distinguished by different namespace prefixes and are defined by different DTDs or schemas.

Here, for example, is an XML snippet using namespaces:

<inventory xmlns:storea=
"http://www.p-qc.com/
books.dtd" xmlns:storeb=
"http://www.amazon.com/schema">
<storea:magazine>
<storea:title>Network 
Magazine</storea:title>
</storea:magazine>
<storeb:magazine>
<storeb:magazine storeb:title=
"Data Communications">
</storeb:magazine>
</inventory>