XML Web Pages Without Tears

Hugh Sparks
Version 1.8, February 5, 2005

The goal of this irreverent tutorial is to help you learn how to use XML and XSL to make your own web pages.

XML lets you create documents that separate information from the way it's displayed. You can design your own markup language to suit your own style and you can specify how your documents will look without learning hundreds of HTML tags or worrying about how different browsers will render them.

XML documents are much easier to understand and maintain because they aren't full of formatting goop. Instead, they are full of information-organizing goop. This is an improvement because you get to decide how the goop looks and what it means.

If you've ever looked at the source for a web page created by Microsoft Frontpage or Netscape Composer, you won't need further motivation to consider XML.

Contents

Prerequisites

You do have to bring a few bits to the table: We're going to talk about translating XML into HTML so you have to know a little HTML to understand what's going on. If you need more background, try one of the countless tutorials on HTML. A rather nice minimal treatment is by Kari Boyce.

To complicate your life further, we will be translating XML documents to a dialect of HTML called XHTML. If you already know some HTML, picking up the differences from the examples presented here will be easy. For more details, try this introduction to XHTML.

The primary tool for translating XML to XHTML is a thing called an XSL stylesheet. If you've never encountered computer programming in any form, the more advanced parts of XSL will probably squeak you.

Nothing can be explained to a stone.
-- John McCarthy

Pitfalls

Don't despair!
When you persue the study of XML and related concepts you will be drowned in a morass of terminology, acronyms and abstract definitions. You will be discouraged or tempted to go out and buy a shelf of three-inch-thick books. (The standard unit of computer knowledge.)

Evil vested interests will try to convince you to use their marvelously complex GUI tools. Corrupted programmers will put references to proprietary programs in their puerile examples. They want you to create XML documents that only work with their employer's browser.

Ignore them all! I hope to convince you in 20 minutes that XML is an easier and better way to make web publications. The only software you'll need is a simple text editor. Far from being a snare for more proprietary technology, XML is a path to liberation from browser dependencies and diabolical marketing scams.


With XML, you can take over the world!

The extensible markup language

XML is a format for structured text

The extensible markup language (XML) is a way of describing any kind of structured information using plain-text documents.

Tags are the boundaries of elements

The structure is imposed on the text by using pairs of tags. Each element of the document has an opening tag, a span of text and a closing tag. The span of text may contain (almost) arbitrary characters as well as other elements.

An opening tag consists of a name in angle brackets: <MyTag>. A matching closing tag uses the same name in this structure: </MyTag>.

XML is case-sensitive: The opening tag <MyTag> does not match the closing tag </myTag>

Elements must be properly nested

XML is quite picky about the the way these tags are used: 1) Every opening tag must be followed by a matching closing tag. 2) Elements must not intersect.

This is legal:

	<One>Some text<Two>More text</Two></One>

This is not:

	<One>Some text<Two>More text</One></Two>

White space is preserved

XML documents can have extra white space and line breaks. This extra white-space is preserved and passed on to any client program that processes the document.

	<one>
		Some text
		<two>
			More text
		</two>
	</one>

Document structure

A complete XML document has a prolog followed by exactly one body element. This body element is called the root of the document.

The prolog consists of one or more processing instructions that describe the document and how it will be processed. The minimum prolog, required by all XML documents, is this special processing instruction:

	<?xml version="1.0"?> 

Processing instructions are always bounded by these special tags: <? processing instruction text ?>

Here's a complete XML document in one line:

	<?xml version="1.0"?><a>some text<b>more text</b></a>

You are free to invent tags in the body element to describe the structure of your text. XML has no pre-defined tags. It's all up to you.

Attributes

An opening tag is allowed to have attributes. Attributes are expressions of the form variable="any string". They appear inside the opening tag after the name. Any number of attributes can be given. Attributes have no meaning at all in XML. The programs that process the XML document may use them as parameters to control how the document is displayed or otherwise handled. An example of a tag with attributes is the expression for a web link in HTML. In this example, the <a> tag contains one attribute "href" whose value is "goop.html":

	<a href="goop.html">A page of goop</a>

Information inside an element can placed in attributes or between the opening and closing tags. The decision is somewhat arbitrary, but most developers try to keep the essential content between the tags. Attributes are used for things like display hints, search keys or other modifiers "at right angles" to the actual data. The value of an attribute can only be a simple string, so all structured information must appear as sub-elements.

Sometimes an opening tag is used only to specify attributes or to act as a marker. There's no text between the opening and closing tags:

	<gack a="This is" b="my Gack"></gack>

In this case the closing tag may be abbreviated:

	<gack a="This is" b="my Gack"/>

Comments

Inside your XML document, you can have comments that are not processed as part of the document. A comment has the format:

	<!-- This is my comment -->

Namespaces

As we have emphasized, you can make up your own set of tags for marking up your documents. But what happens if you want to incorporate information marked up by other people? Suppose they make up tag names that conflict with yours?

To solve this problem, XML has the concept of a namespace. A namespace is simply a set of tags invented to mark up some kind of data. To avoid conflicts, every namespace must have a unique name. People commonly use URLs as namespace names because the domain name part of a URL is registered to an individual or organization. The rest of the namespace name is usually some path relative to the domain. For example, the XHTML namespace has the globally-unique name: "http://www.w3.org/1999/xhtml" Although this looks like the address of a web page, there is no need for such a document to exist. It is only important that the name be unique.

When you invent a set of tags, you can give the namespace a name like "http://mydomain.com/mytags." This name will be unique as long as you own the domain name "mydomain.com".

You assert the namespace for all the tags in your XML document by adding the attribute "xmlns" to the root element. For example:

	<?xml version="1.0"?>
	
	<myPage xmlns="http://mydomain.com/mytags">
		<p>
		This p belongs to me. My p-tag is used to mark sections
		that are preposterous.
		</p>
	</myPage> 

If you want to use some of your friend Joe's markup tags, you can put them in a sub-section of your document and assert his namespace for the scope of that section:

	<?xml version="1.0"?>
	
	<myPage xmlns="http://mydomain.com/mytags">
		<p>
		This p belongs to me. My p-tag is used to mark sections
		that are preposterous.
		</p>
		
		<p xmlns="http://joesdomain.com/joetags">
		This is one of Joe's p-sections. He uses p-tags for paragraphs.
		</p>
	</myPage> 

If you need to use quite a few of Joe's tags, specifying the xmlns attribute in every tag is a bit unwieldly. To save some effort, you can define a prefix for Joe's namespace in the root element of your document:

	<myPage xmlns="http://mydomain.com/mytags"
		xmlns:joe="http://joesdomain.com/joetags">
		
		<p>
		This is a preposterous assertion.
		</p>
		
		<joe:p>
		This is one of Joe's paragraphs.
		</joe:p>
	</myPage> 

Namespace definitions are usually placed in the root tag of the document, but you can define namespaces in the opening tag of any element and optionally specify a prefix for each. The namespace scope applies to the contents of that element and its sub-elements.

The prefixed tags can only be used inside the scope of the element where they are defined. If you put an xmlns attribute in a tag without using a prefix, it becomes default namespace for all tags that have no prefix.

Entities and special characters

Entities allow you to make abbreviations for long strings of text used throughout your document. If you are familiar with the concept of macros found in many programming languages, this is a similar idea.

Some entities are predefined by the XML standard. These entities must be used when you want to use XML markup characters as part of your text. A single <, for example, may be written as &lt; Similarly, a single > may be written &gt;

Entity references always begin with & and end with a ; character. Here is a complete table of the predefined entities:

	&lt;	<	less than
	&gt;	>	greater than
	&amp;	&	ampersand
	&apos;	'	apostrophe
	&quot;	"	quotation mark 

Any Unicode character can be inserted by using special numerical entities:

	&#D;		Any decimal ascii code D
	&#xD;		Any hexadecimal ascii code D 

If you have a section of text that contains lots of special characters, it is easier to put the text inside of a CDATA section:

	<![CDATA[ Any !@*#$<^& text can go in here ]]>

The CDATA section has the opening tag: "<![CDATA[" and the closing tag: "]]>"

Creating your own entities

You can define your own entities that expand into any desired strings. They can be used to make nice names for hexadecimal symbols or abbreviations for frequently used long strings.


Creating your own entities

Entities are defined inside the scope of a DOCTYPE processing instruction element. This element should appear in your document before the root element.

Defining two entities: (The capital letters are required.)

	<!DOCTYPE myRootElement [
		<!ENTITY test1 "my test1">
		<!ENTITY test2 "my test2">
	]> 

As the name suggests, myRootElement should match whatever you've chosen as the name of the root element in your document.

Using these entities in your document source:

	This is &test1; of entities.
	This is &test2; of entities.  

Will result in:

	This is my test1 of entities.
	This is my test2 of entities.

Here are a few commonly used entity definitions:

	<!ENTITY sharp "&#35;">
	<!ENTITY trade "&#8482;">
	<!ENTITY reg   "&#174;">
	<!ENTITY copy  "&#169;">
	<!ENTITY nbsp  "&#160;"> 

Entities my also be used to include the contents of a file in your document. First, define the external file in a SYSTEM entity:

	<!ENTITY myStuff SYSTEM "MyStuff.txt"> 

The entity reference: &mystuff; my be used anywhere in your document to insert the contents of the file.

If you write mathematical documents or other jargon that requires lots of entity definitions, you can put all of them in an external file. For example, here is the content of the file "MyEntities.ent":

	<!ENTITY test1 "my test1">
	<!ENTITY test2 "my test2"> 

To include this file in your document use these expressions inside your DOCTYPE element:

	<!ENTITY % mystuff SYSTEM "MyEntities.ent">
	%mystuff; 

This looks very similar to the SYSTEM entity we used in the previous example, but in this case we are expanding the definition inside the DOCTYPE element itself. This special usage requires us to use the "%" character in the entity definition and in the expression that expands the entity.

More of a good thing

You may have noticed that the DOCTYPE and ENTITY elements look a bit out of place. That's because they're holdovers from parts of the older SGML standard. DOCTYPE has many additional capabilities for validating XML documents. Validation insures that the right elements appear in the right order with the right kind of data inside.

But DOCTYPE is soon to be outdated and replaced by XMLSchema. And XLink, XPath, and XPointer will someday improve your life to such an extent you'll hardly ever need to work for a living.

Unfortunately, most web browsers will only work with the tiny subset of XML covered in this introduction. If your goal is to define your own markup for web pages, you now know all you need to know about XML. Be sure to mention this at a party or nightclub.


"He knows XML"

The extensible stylesheet language

So that's all there is to XML?
Now that I've lulled you into complacency, it's time to reveal the really hard part: To use XML for a web page, you have to translate it into HTML. This can be fiendishly complex. Mwoo ha ha ha ha!

An extensible style sheet (XSL) document is used to define how the text in your XML document will be displayed in a web browser.

An XSL document "runs" when your XML source document is loaded by a web browser. XSL can make multiple passes through the source to extract, reformat and rearrange any of the tagged XML text. With the appropriate elements XSL can, for example, create a web page with table of contents, index, and glossary.

Actually, XSL can transform an XML document into almost anything. XSL is itself an XML document and can be processed by other XSL templates. XSL documents can perform arbitrary computations and include templates from remote sites on the web. You can see that this quickly leads to Things Man Was Not Meant to Know.


XSL can manipulate anything

Format of an XSL document

The xsl file should contain this framework:

	<?xml version='1.0'?>
	
	<xsl:stylesheet version="1.0" 
		xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
		xmlns="http://www.w3.org/1999/xhtml"
	>
		
		<!-- Your templates go here -->
		
	</xsl:stylesheet> 

The first element tell us that an XSL document is type of XML document. The second element identifies the root element, which is an "xsl:stylesheet". The first xmlns (xml namespace) attribute defines a prefix "xsl" that will be used on all tags that are part of the XSL markup. The second xmlns says that tags with no prefix will become part of the XHTML output of the stylesheet.

Using templates

XSL is based on the idea of template-matching. Each distinct tag used in your xml document will have an associated template. Inside the body of the template, you can rearrange the text from the source document and add any additional text. We will be using XSL to create an HTML document that a web browser will display.

Here's the outline of a template:

	<xsl:template match="tagname">
		Your text and XSL expressions go in here.
	</xsl:template> 

Using the text between source tags

Inside a template, you can substitute all the text that appeared between the tags of the source element using this expression:

	<xsl:value-of select="."/> 

Frequently, you want to run the text in the source element through all the templates in your stylesheet until nothing more happens. This allows you to use all sorts of nested formatting tags. The following expression will substitute the text between the source tags after it has been recursively processed in this manner:

	<xsl:apply-templates/> 

If the root element of your XML document is called a page, the template to create an HTML document might look like this:

	<xsl:template match="page">
		<html>
		<head/>
		<body>
			<xsl:apply-templates/>
		</body>
		</html>
	</xsl:template> 

Using the value of attributes

In addition to getting at the text between the tags, it is possible to get the attribute values assigned in the opening tag. Suppose that your "page" tag has a name attribute that you want to appear as the title of the resulting html document:

	<page name="This is my page"> 

The following expression will return the value of the name attribute inside the template for page:

	<xsl:value-of select="@name"/> 

Expanding our previous page template example, here is how this expression might be used:

	<xsl:template match="page">
		<html>
		<head><title>
			<xsl:value-of select="@name"/>
		</title></head>
		<body>
			<xsl:apply-templates/>
		</body>
		</html>
	</xsl:template> 

Creating output tags that have attributes

We can also use templates to create html output tags that have attribute values. Suppose we want to generate an html link expression. Our source xml has this format:

	<link name="My link name" url="goop.html"/> 

We want to transform this xml expression into:

	<a href="goop.html">My link name</a> 

A template to construct this link looks like this:

	<xsl:template match="link">
		<a href="{@url}">
			<xsh:value-of select="@name"/>
		</a>
	</xsl:template> 

The construct with the curly braces is called an attribute value template. Whatever expression you put between the curly braces becomes the value of the attribute. In this case, the value of the attribute href will be the value of the XML attribute url. To obtain this value, we use the XPath expression: @url
We will cover more details about XPath later.

If you need to create an attribute value by combining text with XSL expressions, you need to use an alternative method. The following example uses the <xsl:attribute> element to produce the same result as the previous template:

	<xsl:template match="link">
		<a>
			<xsl:attribute name="href">
				<!-- Any text or XSL can go here -->
				<xsl:value-of select="@url"/>
				<!-- Any text or XSL can go here -->
			</xsl:attribute>
			<xsh:value-of select="."/>
		</a>
	</xsl:template> 

XPath expressions

In the previous examples, we have been using expressions of the form: select="something". The "something" inside the quotes is called an XSL expression. The most common XSL expressions are examples of XPath, which we will now define in more detail.

As we observed in the introduction, well-formed XML documents have a tree structure: One root element that contains a hierarchy of nested elements.

In XPath-speak, we will refer to elements as named nodes. A node is simply an element named by its opening tag. XPath expressions are path names for nodes. The idea of a path name is borrowed from the syntax of file path names in your computer's file system.

The uppermost unnamed level of an XML document is named by a "/". To reach any node in the document, we specify each node separated by more "/" characters. For example the path to the "introduction" of some XML document might be:

	/page/introduction 

(The tag names have no special significance. They are just made up for these examples.)

Frequently there will be more than one node with the same name at a given level. To reach the third paragraph in the first section we would use:

	/page/section[1]/paragraph[3] 

To reach an attribute, we append the "@" symbol to the attribute name. To get the name of the page (assuming it has one) we would use the path:

	/page/@name

Paths that begin with "/" are called absolute paths. As you might assume following the file system metaphor, relative paths are also allowed. Inside a template, the path is relative to the node the template matches. If we were in the template for a section, the path to the third paragraph would be:

	paragraph[3]

The following table shows a small subset of common XPath expressions:

	/           Root of the document
	.           The contents of the current node
	..          Parent node
	a	    The node a
	/a/b/c/d    Absolute path from the root to d
	a/b/c/d     Relative path to d
	../x        The node x in the parent
	//x         x at any depth in the whole document
	.//x        x at any depth below the current node
	x//y        Any y somewhere below x
	x[n]        The nTh x in the current node
	a/b/c/@e    A path to the value of attribute e in node c
	x[@y]       An x with the y attribute
	x[@y="z"]   An x with the attibute y="z" 

We will see XPath expression used in two places: As the value of the match attribute in a template and as the value of a select in the xsl:value-of element.

In the case where a path is used as the value of a template match attribute, the template will be applied to all nodes that have that path.

When a path is used in the value-of construct, it returns the text from the named node. If the node contains other nodes, or names the first of a series of nodes with the same name, the value will be the text from all the nodes and sub-nodes concatenated together. (This big mess is seldom what you want, so most value-of paths should be qualified until they specify a single element.)

By using appropriate XPath expressions in value-of elements, a template can pick out pieces from any part of the whole source document, not just the nodes that match the template.

A good way to learn about XPath is to study lots of examples. After reading through this section, compare the source of this document to its stylesheet. (Links to do this appear in the next section.)

Using namespaces

If you defined a namespace for your source document, you will need to specify that namespace in the stylesheet heading as well. The templates will look nicer if you define a short prefix. In this example, we introduce the prefix "my" for your namespace:

	<xsl:stylesheet version="1.0"
		xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
		xmlns:my="http://mydomain.com/mytags"
		xmlns="http://www.w3.org/1999/xhtml">
	

The templates must be modified to match tags from your namespace:

	<xsl:template match="my:page">
		...
	</xsl:template>

	<xsl:template match="my:link">
		...
	</xsl:template> 

The components of XPath expressions must also use your prefix:

	<xsl:value-of select="/my:page/my:chapter/my:paragraph[1]"/> 

Attribute names normally don't need a prefix because they don't automatically belong to the default namespace. By default, attribute names don't belong to any namespace at all. Here's an example of going after the font attribute value in some paragraph:

	<xsl:value-of select="/my:page/my:chapter[1]/my:paragraph[5]/@font"/> 

If you design a set of tags and use them exclusively in your XML source document, you don't really need to define a new namespace. The new tags will not belong to any namespace and won't require a prefix in the XSL document.

You must use prefix and namespace declarations when you combine markup from more than one namespace in the same document. Otherwise, there would be no means to resolve conflicts between tags with the same names but different interpretations.

Another reason to consider using a namespace is to enable migration to a new markup in the future: If you decide someday to change all your documents to some new-and-improved markup, you can do so with an appropriate XSL transformation. But if the new markup uses any of your old tags, you'll need a namespace prefix to distinguish them:

	<xsl:template match="myOld:page">
		<page>
			...Rearrange stuff from myOld page...
		</page>
	<xsl:template> 

More XSL

XSL has variables, if-then statements, for-loops and all the string handling functions you could ever want: It is a programming language. Rather than presenting contrived examples, I will take the lazy way out and suggest that you study the source and stylesheet used to produce this page.

The references at the end of this document include pointers to more complete XML and XSL presentations, the W3Schools tutorials are brief and sufficient for many users.

Putting it all together

The XLS transformation may be performed inside the web browser or on the server. These two strategies are called client-side or server-side processing respectively.

Client side processing

One big advantage of client-side processing is that you don't need any special software to publish your XML documents. Most ISP's provide some space for client web pages. All you need to do is put your XML and XSL files in the directory provided by your ISP.

An XML document is linked to its stylesheet by adding a special processing instruction right after the XML heading:

	<?xml version="1.0"?>
	<?xml-stylesheet type="text/xsl" href="http://www.csparks.com/stylesheet.xsl"?>

The href can be a complete url as shown or just the name of an XSL file in the same directory as the XML document.

If you simply want to play around with XML and XSL on your own machine, link the XML documents to their stylesheets using a local file name:

	<?xml version="1.0"?>
	<?xml-stylesheet type="text/xsl" href="stylesheet.xsl"?>

You can now double-click on the XML document and your browser will execute the transformation and display the page.

The biggest disadvantage of client-side processing is that quite a few people have browsers that are too old to display XML using XSL transformations. Pages also take longer to display because both the XML and XSL documents must be downloaded before the page can be generated.

Server side processing

Server-side processing transforms the XML document on the web server. This has the advantage of being independent of the client's browser software. It is also faster because only the resulting HTML gets downloaded to the client.

Unfortunately, none of the popular web-servers can directly handle XML/XSL documents. To do server-side processing, you must be able to install and configure software on the server machine. Details of one approach may be found at: Server Side XML Without Tears

You can forget about doing this if you use one of the "big name" national ISPs. Smaller local ISPs often provide Unix shell accounts where you can telnet to your web page directory, edit files, and compile your own server-side software. With this level of service you can probably set up some kind of XML processing.

A few enlightened ISPs offer servers that are pre-configured to support XML. If your ISP does this, you are in luck.

Many specialized web-hosting services offer XML support. You will have to pay them as well as your ISP, but the cost for a small web site (250Mb) is less than U.S. $10.00 a month. To see some examples, do a web search with the search string:

	"Web hosting" Tomcat Cocoon 

Tomcat and Cocoon are two popular software tools used for server-side XML processing.

If you have a broadband connection and your ISP allows you to run your own services, you can configure your own web site using whatever software you like. Tomcat and Cocoon, for example, are available for both Linux and Windows.

A complete example

The following XML document and stylesheet provide a very simple example to test your browser.

To try these out, create two plain-text files and paste in the contents as shown below. You should be able to double-click on the test.xml file and see the results in your browser.

Contexts of test.xml:

	<?xml version="1.0"?>
	<?xml-stylesheet type="text/xsl" href="test.xsl"?>
	
	<page name="This is my page">
		<p>Here is a paragraph.</p>
		<p>Here is another paragraph.</p>
	</page> 

Contents of the file test.xsl:

	<?xml version="1.0"?>
	<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

	<xsl:template match="page">
		<html>
		<head><title>
			<xsl:value-of select="@name"/>
		</title></head>
		<body>
			<xsl:apply-templates/>
		</body>
		</html>
	</xsl:template>	

	<xsl:template match="p">
		<p><xsl:apply-templates/></p>
	</xsl:template>
	
	</xsl:stylesheet> 

Studying the stylesheet for this document

This document is produced with a stylesheet that uses a few XSL tags not presented here. You will be able to figure most of them out from the context and by seeing the results.

Create a new browser window using the File/New menu item. Next, select "View Source" from the browser menu. You will be horrified to see that this document is not written in XML at all! Actually, you are seeing a translation from the original XML to HTML done auto-magically by my server. This was done so that people with old browsers don't email me with complaints that they can't see my web pages.

To send the real XML document directly to your browser, open this link in the new browser window. Depending on your browser, you will either see no change, a blank window, or a big mess of unformatted xml. In any case, select "View/Source" from the browser menu. A new window will appear that shows the XML source to this page just as it was written.

To see the style sheet XSL source, follow this link. Your browser may display the the stylesheet directly, but it is better to do a "View/Source" to see the style sheet with the original formatting.

Compare the XSL templates to the XML source tags they match and study how the results look in this document. After about 20 minutes of study, you will feel a blast of blinding revelation: You will understand how XSL is processing this document.


After comparing the XML and XSL documents

The opposition

Some clever minds insist that learning XML is a waste of time, so before undertaking the 20 minutes you'll need to study this document, you should first take a look at Don't Learn XML

If you've already read The extensible markup language section above, it's too late for you.

Other authorities say XML is great but little people like us shouldn't use it to design our own markup. Although I disagree, there is considerable merit in the alternatives they suggest: DocBook and XHTML

Going forward

I hope you see how easy it is to create web documents using XML and why they are easier to maintain. If you've created a home page using HTML, try converting it to XML with markup you design.

Other stuff

If you care to suggest improvements, I'd like to hear from you. My home page has other projects and dementias.

References


This web page is best viewed with a computer