Server-side XML Without Tears

Hugh Sparks
Version 1.3, December 31, 2003

Introduction: Why do this?

You like the idea of using your own markup for web pages. You've created some XML documents and XSL stylesheets that suit your bizarre tastes. When you publish these documents, you discover that many people are still using Microsoft Internet Explorer 5.x or KDE Konqueror and can't appreciate your vital missives.

If this sounds familiar, you need server-side XML processing: This allows you to use all the XML/XSL tricks and tools but still serve up plain-vanilla HTML to the vast unwashed.

A unique feature of the treatment presented here is the ability to mix XML documents with normal HTML documents on your website. If your site contains XML documents configured for client-side processing, these techniques will allow you to do server-side processing without changing anything. This "transparency" means there are no special directories or weird URLS. You can switch back to client-side processing in the future by simply turning off the servlet engine.

If you're just starting out and want to setup server-side XML processing with Apache, Tomcat and Cocoon, this is the quick fix.


The Agenda

1) Install and configure recent versions of:

	Java SDK
	Jakarta Tomcat
	Jakarta Tomcat Connector JK2
	Apache Cocoon 

2) Get the Apache web server to handle XML documents by passing them to Cocoon WITHOUT rebuilding Apache from source. We will use the stock Redhat httpd rpm.

3) Minimize fun: You will not be able to play around with the Tomcat or Cocoon demos, databases, dancing bears or anything else but processing XML files through XSL stylesheets. As you can see, we are serious people here.

No Dancing Bears


You will need your own Linux server or permission to install and configure software on someone else's machine.

You need to know about processing XML documents with XSLT. A quick introduction is at XML Web Pages Without Tears.

Although no programming is required, experience with Linux, the Bash shell, and software installation from tarballs is required. Frankly, it would astonish me to hear that you're not a typical software geek if you've read this far.

What is all this stuff?

Pandora's Box

Java servlets are programs that run on the web server to generate responses when special web documnets are requested by a client browser.

Java applets, in contrast, are programs that get downloaded to run on the client in when special web documents are requested by a browser.

The Apache web server doesn't know anything about Java, so it must use an extension to run servlets. Such a program is called a servlet container, or sometimes a servlet engine.

We'll be using Jakarta Tomcat as our servlet container. Tomcat can also function as a fairly complete web server, fill out tax forms, and sort laundry, but we won't using these extra capabilities here.

Apache Cocoon is a Java servlet that processes XML documents using XSLT stylesheets. It can do a lot more. In fact, it can do more than man was meant to know.

We have to get Apache talking to Tomcat when special URLs are recognized. The software that does this is called a connector There are several different connectors described at the Jarkart Tomcat Connectors site. The method presented here, JK2, uses an Apache module called mod_jk2.

Linux distribution dependencies

This article was developed on a Redhat 9 system, but the components described here are installed from tar.gz files rather than rpms. In most cases, this is the only format available on the the sites that originate the software. One exception is the Apache web server, which is installed from the binary rpm supplied by Redhat.

In the discussion that follows, I'll use the expression $serverRoot to designate the location of the Apache configuration and module directories. You don't actually have to make a definition for serverRoot, just keep in mind where your distribution wants these files.

In the Redhat distribution, $serverRoot would refer to /etc/httpd.

Installing the Java SDK

I'm using:


This version includes "NutBeans", Sun's attempt at making an interactive GUI for Java developers. If you're not amused by such things, they also offer a much smaller "no beans" version. If you're not going to do Java programming, this might be a better choice.

The file is a "shar" archive. To install the software, make the bin file executable and run it like a program:

	chmod +x j2sdk-1.4.2-nb-3.5-linux.bin

The installer will offer to put everything in the directory:


For Redhat-ish reasons, I changed this to:


Go into the /usr/java directory and create a symbolic link:

	cd /etc/java
	ln -s j2sdk1.4.2 javaHome 

Create the file: /etc/profile.d/ containing:

	# - Path variables for Java SDK

	export JAVA_HOME=/usr/java/javaHome
	export PATH=$PATH:$JAVA_HOME/bin
	export JAVA_OPTS=-Djava.awt.headless=true 

The headless option enabled in the last definition above is only required if you operate a headless server: a machine that is not running X-Windows. On a headless machine, Java programs cannot perform graphical operations without this feature.

Q: Why would someone want Java to do graphics without a monitor?
A: Many Java servlets generate off-screen images that get downloaded to the client browser.

Continuing, we fix the permissions and install the definitions:

	chmod a+rx /etc/profile.d/
	source /etc/profile.d/ 

To test your path, try:

	which java
	which javac 

If you get good answers, you are ready to go on.

Installing Jakarta Tomcat

I'm using:


Unpack the archive and create a symbolic link:

	cd /usr/local/src
	tar xzf jakarta-tomcat-4.1.29.tar.gz
	ln -s jakarta-tomcat-4.1.29 tomcat 

Create the file: /etc/profile.d/ containing:

	# - Path variables for the Tomcat servlet container

	export TOMCAT_HOME=/usr/local/src/tomcat
	export LD_ASSUME_KERNEL=2.2.5 

The definition for LD_ASSUME_KERNEL is some kind of stability hack suggested by the release notes. On my server, Tomcat had to be restarted once a day to prevent hang-ups before I read about this trick.

Fix the permissions and load the file:

	chmod a+x /etc/profile.d/
	source /etc/profile.d/ 

To test your path, try:


If you get an answer, you are ready to go on. Start tomcat in "testing mode" run 

Wait for the messages to stop. It takes longer than you think. Fire up your browser and try:


If you see the welcome screen from Tomcat, you have success.

Use <control>C to stop the server before proceeding to the next step.

Installing Apache Cocoon

Recent and future versions of Cocoon are being distributed in source form only. It is surprisingly easy to build Cocoon, so don't be squeaked.
I'm using:


Unpack the archive and create a symbolic link:

	cd /usr/local/src
	tar xzf cocoon-2.1.3-src.tar.gz
	ln -s cocoon-2.1.3 cocoon 

Create the file: /etc/profile.d/ containing:

	# - Path variables for cocoon

	export COCOON=/usr/local/src/cocoon
	export PATH=$PATH:/$COCOON 

Fix the permissions and load the file:

	chmod a+x /etc/profile.d/
	source /etc/profile.d/ 

To test your path, try:


If you get an answer, you are ready to go on.

This version of cocoon has worked well for me, but the "unstable" parts create lots of error messages during the compile and also at runtime when starting. I like to get rid of as much of this as possible by removing the unstable blocks before I build. This eliminates all the build and runtime errors. Doing this also reduces the size of Cocoon from about 150 megs down to 50 megs.

To eliminate the foof, we need to create two files in the top level cocoon directory. These contain overrides for the build script that will eliminate unwanted components.

Create $COCOON/ containing:


Create $COCOON/ containing:


Now we can build Cocoon from the source:


Test cocoon using its built-in servlet container:
(This test doesn't require Tomcat.) servlet 

Wait for the messages to stop, fire up your browser and try:


If you see the welcome screen, you have success.

Use <control>C to stop the servlet before proceeding to the next step.

Cocoon Welcome

Updating the xml parser in Tomcat

You only need to do this with versions of tomcat older than 4.1.29

First remove the old XML parser:

	rm $TOMCAT_HOME/common/endorsed/xercesImpl.jar 

Then copy updated files from Cocoon:

	cd $COCOON/build/webapp/WEB-INF/lib
	cp xerces-*.jar xalan-*.jar xml-apis.jar \

Testing Cocoon with Tomcat

Move the cocoon webapp to tomcat's webapps directory:

	mv $COCOON/build/webapp $TOMCAT_HOME/webapps
	mv $TOMCAT_HOME/webapps/webapp $TOMCAT_HOME/webapps/cocoon 

Start tomcat in your shell window: run 

Wait for the messages to stop, fire up your browser and visit:


When you see the welcome screen, you have success.
Use <control>C in the shell window to stop tomcat.

Starting Tomcat as a service at boot time

When using Tomcat in production on a server, it is considered wise to run it under a user account rather than root.

To prepare for this arrangement, we create a tomcat user and group:

	useradd tomcat 
	# Non-redhat systems may require an additional command to create the group. 

Modify the ownership of the distribution files:

	chown -R tomcat:tomcat $TOMCAT_HOME 

If you tinker with Tomcat or Cocoon while logged in as root (shame!) you must remember to do this chown step again to make sure all the files remain accessible to the tomcat user.

Create the file: /etc/rc.d/init.d/tomcat containing:

	# Startup script for the Jakarta Tomcat servlet container
	# chkconfig: 345 20 80
	# description: Starts the Tomcat servlet engine
	. /etc/init.d/functions
	. /etc/profile.d/

	if [ ! $TOMCAT_HOME ] ; then
		echo "Please define TOMCAT_HOME in /etc/profile.d/"
		exit 1
	case "$1" in
			chown -R tomcat:tomcat $TOMCAT_HOME/*
			action $"Starting Apache Tomcat: " \
				su -l tomcat -c '$TOMCAT_HOME/bin/'
			if [ $? = 0 ] ; then
				touch $TOMCAT_LOCK
			action $"Stopping Apache Tomcat: " \
				su -l tomcat -c '$TOMCAT_HOME/bin/'
			rm -f $TOMCAT_LOCK
			rm -rf $TOMCAT_HOME/work
			if [ -e $TOMCAT_LOCK ] ; then
				echo $"Tomcat appears to be running"
				echo $"Tomcat is not running"
			$0 stop
			sleep 2
			$0 start
			echo $"Usage: $0 {start|stop|status|restart}"
        		exit 1

	exit 0 

Note that the script fixes the tomcat ownership for all webapps each time it starts. This is a very quick step if you don't compile in all the examples and it prevents truely obscure errors when you forget and edit something as root.

Note that the $TOMCAT_HOME/work directory gets deleted when we shut down. This is where Tomcat keeps its dynamic state information, compiled jsp pages and other detritus. I have found that getting rid of these files between runs prevents many peculiar and irritating behaviors such as persistently serving old versions of modified web pages and spraying on my furniture.

About the Tomcat startup script

This script implicitly depends on several environment variables. These were defined by /etc/profile.d scripts listed in the java and tomcat installation procedures:

	Defined in /etc/profile.d/

		JAVA_HOME	: Location of java installation
		JAVA_OPTS	: Options for starting java
	Defined in /etc/profile.d/

		TOMCAT_HOME	 : Location of Tomcat installation
		LD_ASSUME_KERNEL : Stability fix suggested by the release notes 

Test the Tomcat startup script

In the shell window, execute:

	service tomcat start 

You should be able to visit these urls in your browser:

	Tomcat	http://localhost:8080

	Cocoon	http://localhost:8080/cocoon 

To install the tomcat script for automatic activation at boot time:

	chkconfig --add tomcat 

Leave tomcat running and proceed to the next step.

Building mod_jk2 from source

We now build the jk2 connector, an Apache web server module. This module lets Apache redirect requests for designated documents to Tomcat. All other documents are handled in Apache as usual.

If you peruse the Apache web site, you will find four different connector projects. All of them do pretty much the same thing as far as we're concerned.

		mod_jserv	Obsolete, damned, blasted.
		mod_webapp	Deprecated, despised, shamed.
		mod_jk		Tolerated. Probably immoral.
		mod_jk2		Approved. Enabled by default. 

I used mod_webapp for a year with no problems at all. Next, to be fashionable, I tried several versions of mod_jk, which never worked quite as well. Early this year, I tried mod_jk2 and found that it wouldn't stay running overnight. After reverting to mod_jk for a while, I finally switched back to the newest version of mod_jk2. I feel better already.

Users of Deprecated Software

I'm using:


Unpack the archive to obtain:


We will call this directory $con for short:

	export con=/usr/local/src/jakarta-tomcat-connectors-jk2-2.0.2.src 

Some libraries in Redhat 9 need new symbolic links:

	cd /usr/lib
	ln -s 

You must have the httpd-devel rpm installed, so take care of that if necesssary.

Go into the "native2" directory:

	cd $con/jk/native2 

Run the "pre-configure" script:


Run configure:

	./configure --with-apxs2=/usr/sbin/apxs 

Now do the make:


The make script asks you to run libtool:

	libtool --finish /usr/lib/httpd/modules 

The result is in:


Installing and configuring mod_jk2

Configuring Apache

Copy to the Apache modules directory:

	cp $con/jk/build/jk2/apache2/ $serverRoot/modules 

Create the file: $serverRoot/conf.d/mod_jk2.conf containing:

	LoadModule jk2_module modules/ 

Edit $serverRoot/conf/httpd.conf:
You must explicitly configure the port number for your host:


If you use any virtual hosts, they each need port numbers:

	NameVirtualHost *:80

	<VirtualHost *:80>
       	 	DocumentRoot /var/www/html

	<VirtualHost *:80>
		DocumentRoot /var/www/html/hardinge

Configuring mod_jk2

I found many aspects of the documentation about mod_jk2 at a bit frustrating. By now, no doubt, that unfortunate state has been rectified by the dedicated writers who contribute to the project.

The jk2 module uses the configuration file $serverRoot/conf/ The following exampled worked for me. I choose to keep the shared memory file and log file in the Redhat directory for the Apache logs. For other Linux distributions, there would be different appropriate locations.

WARNING: Folklore ahead

Create the file: $serverRoot/conf/ containing:


	info=Ajp13 channel forwarding over a tcp socket

	info=Shared memory for multiprocessing

	info=Status worker

	info=Display jk2 status page

	info=Display Cocoon welcome page 

The [shm:] section sets up a "shared memory" file used when running with multiple processes. We're not doing that here, but configuring the file prevents multiple error messages in the log file. You have to create this file by hand:

	dd if=/dev/zero of=/var/log/httpd/mod_jk2.shm bs=1048576 count=1 

Some developers like to put this file in the $TOMCAT_HOME/work directory. This seems like a good idea, but I like to blast the work directory every time Tomcat restarts. Because Apache and mod_jk2 may still be running, it doesn't seem like a good idea to delete this file.

The [uri:/xml/*] section tells Apache to send all URLs that begin with "xml/" to Tomcat.

Restart Apache to load the new configuration:

	service httpd restart 

The [status:] section configures a url where you should see a status report in your browser window:


Configuring the Tomcat side

The $TOMCAT_HOME/conf/server.xml that comes with the distribution will work 'out of the box' but I use the following minimal configuration. It gets rid of all the examples and disables the Tomcat web server. It only allows Tomcat to service requests sent throught the mod_jk2 connector. This is, IMHO, a security advantage.

If you want to preserve the original server.xml file, rename or move it somewhere else.

Create the file: $TOMCAT_HOME/conf/server.xml containing:

	<Server port="8005" shutdown="SHUTDOWN" debug="0">
	<Service name="Tomcat-Standalone">
		<Connector className="org.apache.coyote.tomcat4.CoyoteConnector"
			port="8009" minProcessors="5" maxProcessors="75"
			enableLookups="true" redirectPort="8443"
			acceptCount="10" debug="0" connectionTimeout="20000"
		<Engine name="Standalone" defaultHost="localhost" debug="0">
			<Logger className="org.apache.catalina.logger.SystemErrLogger"/>
			<Host name="localhost" debug="0" appBase="webapps"
				unpackWARs="true" autoDeploy="true">
				<Context path="" docBase="ROOT" debug="0" reloadable="true"/>

Edit the file: $TOMCAT_HOME/conf/ so it contains only this line:


This location for shared memory file must agree with the value set in workers2.properies.

Make sure the distribution files still belong to tomcat:

	chown -R tomcat:tomcat $TOMCAT_HOME 

Restart tomcat to load the new configuration:

	service tomcat restart 


In a web browser window try this URL:


If you see the Cocoon welcome page, all is well.

Configuring log rotation

Both Cocoon and Tomcat like to pile up numerous huge log files on your server.

Tomcat log files can be difficult to manage. By default, Tomcat creates and rotates several logfiles by itself, but never deletes the oldest ones. We have mitigated this problem in the server.xml file shown above. It configures the Engine container to use SystemErrLogger, which goes to catalina.out by default.

Since we aren't using Tomcat to normalize the axis of the Earth or bring back the Elder Gods, we don't need the Administrator and Manager applications. To eliminiate them and their annoying log files, simply remove or rename these files:


With these changes, we end up with only one log file:


To manage this file, create: /etc/logrotate.d/tomcat.rotate containing:

	{	copytruncate
		rotate 5

Cocoon has a well-behaved loggger that will rotate under the control of a configuration file:


After becoming weary of editing this large file every time I updated Cocoon, I decided to go with the default settings and let logrotate take care of the mess. Create the file: /etc/logrotate.d/cocoon.rotate containing:

	{	copytruncate
		rotate 5

Note the use of full path names in the logrotate scripts. I found that the shell script variables defined in /etc/profile.d for tomcat and cocoon are not available to the logrotate program.

At this point you might want to stop Tomcat, clean out the logfiles and restart to get the "one log to rule them all" configuration:

	service tomcat stop
	rm -f $TOMCAT_HOME/logs/*
	service tomcat start 

Making your website a Cocoon sub-site

By making your main website a Cocoon subsite, you can mix xml files served by Cocoon with all your other web documents served by Apache.

Stop tomcat:

	service tomcat stop 

We will make cocoon the default webapp by editing $TOMCAT_HOME/conf/server.xml. Inside the <Host> element, change the value of the docBase attribute to read:


In the top-level cocoon directory, create a symbolic link to your Apache web site:

	cd $TOMCAT_HOME/webapps/cocoon
	ln -s /var/www/html xml 

The name of this symbolic link must match a trigger url configured for mod_jk2.
Edit the file $serverRoot/conf/ and change the trigger [uri:/cocoon/*] so it reads:

	info=Access xml documents on the website

You must restart Apache for this change to take effect:

	service httpd restart 

Any xml file on your website will get sent to cocoon if it has the "xml/" prefix:


Note that there is no "xml" directory on your website. We'll get rid of the "xml/" prefix completely in the next section.

Achieving complete transparency

At this point, you can integrate xml files with the other documents on your website. Normal html will be handled by Apache while xml files with go to Cocoon via Tomcat. The remaining annoyance is that pesky "/xml" path element in the URL: This gives away all your secrets!

The motive for hiding the "/xml" trigger is more than cosmetic: You would like to organize your website so that someday, when the majority of client browsers support xml, you will be able to make them do all the work. Toward this end, we will now hide the "/xml" path element using Apache's mod_rewrite feature.

Using mod_rewrite with Tomcat connectors has one or two pitfalls that have discouraged some developers. By following these guildlines, you will avoid all difficulties.

The first pitfall concerns the order of url processing: we must have the rewrite rules applied before the Tomcat connector sends the request to cocoon. This can be insured by loading mod_rewrite before we load mod_jk2. If you are using the Redhat 9 httpd package, the default /etc/httpd/conf/httpd.conf file will automatically load all the module configuration files in /etc/httpd/conf.d before loading mod_rewrite, so all is well. If you are using your own Apache configuration file, you must insure that mod_rewrite loads after mod_jk2.

The second pitfall concerns virtual hosts. The method for dealing with these is given in the configuration examples that follow.

We will be adding some directives to the end of your /etc/httpd/conf/httpd.conf file:

Rewrite directives without virtual hosts

	RewriteEngine on
	RewriteRule (.*)\.xml$ xml/$1.xml [P] 

Rewrite directives with virtual hosts

	NameVirtualHost *:80

	<VirtualHost *:80>
        	DocumentRoot /var/www/html
		RewriteEngine on
		RewriteRule (.*)\.xml$ xml/$1.xml [P]

	<VirtualHost *:80>
	        DocumentRoot /var/www/html/host1Root
		RewriteEngine on
		RewriteRule (.*)\.xml$ xml/$1.xml [P]

	<VirtualHost *:80>
	        DocumentRoot /var/www/html/host2Root
		RewriteEngine on
		RewriteRule (.*)\.xml$ xml/$1.xml [P]

The only difference is the placement of the rewrite directives. When using virtual hosts, you must configure mod_rewrite in each virtual host that needs to handle xml files.

In either case, the RewriteRule that does the magic is the same. It simply appends the "xml/" path element onto any URL that ends with ".xml".

The "[P]" flag on the end of the rule makes the browser display the original URL rather than the rewritten version.

This method is so successful, there is no way to see the original xml file in a web browser. In order to force client-side processing for testing, we add this rewrite rule:

	RewriteRule (.*)\.XML$ xml/$1.XML [P] 

You will also need this match pattern in your sitemap.xmap:

	<map:match pattern="**.XML">
		<map:generate src="{1}.xml"/>
		<map:serialize type="xml"/>

With these changes, you can request an xml file by changing the URL so it ends with the capital letters ".XML". The file will be sent directly to your browser without server-side processing.

Members of the audience that are not tranced-out at this point may note that a simpler match pattern will work:

	<map:match pattern="**.XML">
		<map:read mime-type="text/xml" src="{1}.xml"/>

This rule will send the raw xml file to the client browser, but it will not allow the browser to view the document source.

Configuring your sitemap.xmap

You don't need to edit the default Cocoon sitemap in any way. Instead, create a sub-sitemap in your website directory.

You needs will vary, but here is a minimal sitemap that will process all XML documents through a single XSL stylesheet.

Create the file: /var/www/html/sitemap.xmap containing:

	<?xml version="1.0" encoding="UTF-8"?>
	<map:sitemap xmlns:map="">

		<map:match pattern="**.xml">
			<map:generate src="{1}.xml"/>
			<map:transform src="test.xsl"/>
			<map:serialize type="html"/>


Testing the whole thing

If you've come this far, you're ready to test everything. Create an xml document: /var/www/html/test.xml containing:

	<?xml version="1.0"?>
	<?xml-stylesheet type="text/xsl" href="test.xsl"?>
	<page name="My XML Web Page">
		<p>Here we see the little man</p>
		<p>Behind the little curtain.</p>
		<p>If Tomcat doesn't drive you nuts,</p>
		<p>Cocoon will almost certain.</p>

Note: The xml-stylesheet tag in the example above is not used by our server-side processing. We include this tag to illustrate how the same documents could be set up for either client-side or server-side processing.

Create an XSL stylesheet: /var/www/html/test.xsl containing:

	<?xml version="1.0"?>
	<xsl:stylesheet version="1.0" xmlns:xsl="">

	<xsl:template match="page">
			<xsl:value-of select="@name"/>
			<h3><xsl:value-of select="@name"/></h3>

	<xsl:template match="p">
		<i><xsl:value-of select="."/></i><br/>

Stop and restart everything to make sure you have a clean slate:

	service tomcat stop
	service httpd stop
	service tomcat start
	service httpd start 
	It takes Tomcat about 20 seconds to get going...

Now fire up your browser and visit:


You should see the little man behind the curtain.

To force client-side processing, use this URL:


Everything should look the same.

Inspect the result of all your efforts


Primary References

Other tutorials

More of my stuff

Credits and Apologies

William Beard, Dancing Bears.
Arthur Rackham, Pandora's Box.
Walt Disney Studios©, Cheshire Cat.
Twentieth Century Fox©, Alien Egg.
Gustave Dore, Scene from Dante's Inferno.
Joe Nolte, Aebleskivers.
Matt Pranger, Log Rolling.
Glen Baxter©, New York Art Critics.