XML – Lingua Franca or Lost Cause?

XML should have been the solution to so many problems: just structure documents semantically and you can rejigger them any way that you want. But there is more to it than that…

No one knows how many web pages there are on the Internet, but they certainly contain an unconscionable amount of information. The sad fact is that, despite the best efforts of search engines such as Google and Bing, the great majority of that information is amorphous, unstructured, hence tough to retrieve.

Try to find a simple fact such as the cost of a loaf of bread 60 years ago in England and you will be offered all manner of unrelated information, but surprisingly few actual figures.

Semantic mark-up

This is because HyperText Markup Language (HTML), the near-universal scheme for marking up web pages, is all about form rather than content.

Search engines can only return hits containing the words of your search term in collocation, rather than those based on their meaning. Wonderful though HTML has been, it has in this sense outlived its usefulness. If we are to be able to access information in any meaningful and efficient way, it needs to be marked up to structure that content.

Mark-up languages such as HTML, Rich Text Format (RTF), TeX (still popular in academia), even Portable Document Format (PDF) and PostScript (PS), are great at formatting and appearance, but next to useless when you need to structure information.

The first mark-up language intended to impart semantic structure to document content was Standard Generalized Markup Language (SGML), way back in the 1960s. Although still supported by specialist products such as Framemaker, it proved too generic for widespread adoption, and Extensible Markup Language (XML) was spun out as a subset in 1998.

Since then XML has become pervasively common, used for a great variety of different types of information, from the property list (.plist) files that fill your Preferences folders, to the ‘new’ document formats used by office software suites including Apple’s iWork.

How it works

Superficially, XML source code looks similar to HTML, in that it is formed from text – Unicode to represent the character sets of all languages – with embedded tags such as <genus>Absconditella</genus>.

Unlike HTML, the range of tags is limitless, being defined by a cited Document Type Definition (DTD) or Schema (in .xsd files). That particular tag might be used in biological content, to set the genus under consideration.

Mac property list files, being unconcerned with biological nomenclature, consist of dictionaries containing named keys such as <key>AMRemotePort</key> and associated data such as an integer value <integer>30341</integer>.

These domain-specific tagsets make XML so potent at structuring content. A linguist might perform detailed grammatical analysis by marking up parts of speech such as <conjunct>but</conjunct>, in a quest to understand the use of conjunctions.

A table of historical food prices might contain the price of a loaf in a given year as
<item>
<line>loaf of bread</line>
<year>1952</year>
<cost_lsd>6d</cost_lsd>
<cost_p>0.025</cost_p>
</item>

making it easy to retrieve the cost, either in pre-decimal pounds, shillings, and pence (tagged cost_lsd) or modern decimal pounds (cost_p).

The fundamental limitation to content retrieval in XML is set by the level at which content is marked up, thus the DTD and Schemas employed. A text on the recent social history of the UK will probably opt for a generic book format, such as the popular DocBook standard, perhaps containing the fragment
<para>... but the cost of a loaf in 1952 was only 6d (£0.025)...</para>
which does not help the linguist’s interest in conjunctions, nor does it make it any easier to locate the price of a loaf. Indeed it is disturbingly similar to HTML’s <p>...</p> equivalent, and is only weakly semantic.

Transforms

Being concerned only with structuring content, and not the format in which it might be displayed, XML is not a complete solution to anything. Copy a property list file from a Preferences folder, change its extension to .xml, and open it in Safari, with its excellent XML support. It will be displayed as raw source, tags and all, as the browser knows no better way to style a property list for the screen.

If you want an XML document to be output in a more accessible way, you need to transform it to HTML, PDF, or another format that includes styling. For a quick solution in Safari, you may get away with a CSS stylesheet, but in most cases it is preferable to use an XSL stylesheet in an XML development environment like oXygen.

On the right, native XML, transformed to HTML and rendered on the left, in Safari.
On the right, native XML, transformed to HTML and rendered on the left, in Safari.

Editing XML

So the XML model of content publishing is necessarily complex. Text content is developed using any of the wide range of applications that work in, or can export, XML source. But the editor needs to be fully aware of the mark-up tags being used, as defined in the DTD and Schema files.

Experienced HTML authors may be happy writing raw code because they are working with a fixed tag library; in XML the tags are domain-specific and can be both extensive and complex. Coding up raw XML is a highly specialist pursuit.

Oxygen now sports a WYSIWYG editor for XML documents.
Oxygen now sports a WYSIWYG editor for XML documents.

Generic text editors such as BBEdit are therefore of only limited use when creating XML content. Specialised editors that are specifically designed for custom tagsets have never proved particularly successful either, with the main OS X contender, Syntext Serna, lingering in limbo at present, and only oXygen succeeding because of its cross-platform power and versatility.

You are more likely to prefer creating content using an application that already saves files in XML, such as Microsoft Word 2011 or Pages, or one that has XML support, such as Adobe InDesign.

Developing XML

Finely granular content is then ideally suited to storage in a database; whilst FileMaker Pro and SQL-family servers such as Postgres all cope with XML data, they are unsuited to the hierarchical structure most widely found in complex XML content.

BaseX is a powerful XML-native database which works excellently with its hierarchical structure.
BaseX is a powerful XML-native database which works excellently with its hierarchical structure.

XML-native databases such as BaseX, eXist, sedna, and the commercial heavyweight MarkLogic 8 are available and mature, but unusual and unlikely to be available on web hosting services.

XML is thus most commonly found in bespoke publishing systems, where mark-up schemas have been developed for the purpose, and authoring and output have become specially adapted.

Creating multiple formats from the same content is then very efficient: once suitable output stylesheets have been developed, there is minimal effort required to produce print, web, and electronic book versions, for example. With judicious design, tables of contents, indexes, and electronic links are adjusted automatically for each version, resulting in large savings in time and cost.

For the great majority of the web, and documents in general, XML has not proved the solution that many had envisaged. Its quest to be all things to all people has placed great emphasis on customisation of DTDs, Schemas, XSL stylesheets, authoring tools, and content storage.

The level of skill and knowledge required to put all these together and make them work is beyond that expected of regular users. Accordingly XML thrives internally in applications, in OS X’s property lists, and in larger businesses.

It remains to be seen whether someone might come along with an innovative but friendly front end that will start to transform the way that we handle everyday content such as websites. After 17 years, though, the prospects do not seem bright.

Technique: DTD or Schema?

Document Type Definitions (DTDs) were inherited from SGML, and were initially intended to be the main way of setting up domain-specific and custom document structures. Crafted in rather atypical XML, they set out the elements that can be contained in each given document type, and for each element its attributes.

A biological genus might there appear in the element definition
<!ELEMENT genus (#PCDATA)>
where PCDATA is ‘parsed character data’, or a text field. Unfortunately DTDs are founded on this data type, rather than distinguishing between text, numbers, and the like, and are applied within a single, common namespace.

Although DTDs remain popular if you are concerned only with publishing larger and coarsely granular documents, since 2009 such definitions have tended to be put into XML Schema files, bearing the extension .xsd, instead. This is because supported data types are richer and stronger, and such Schemas are more readily put to use in Java and other environments.

XML Schema files are standard XML documents, in which the equivalent definition of a ‘genus’ element could read
<xs:element ref="genus" type=“xs:string” minOccurs="0"/>
when nested inside a complex element container.

Schemas can use their own namespace, which makes them better suited to more complex situations that require multiple definition files. They also have a much richer range of data types, based on 19 primitive types, making them better suited to databases, for instance.

RELAX NG, standing for the longwinded ‘REgular LAnguage for XML Next Generation’, is an alternative type of schema that can be expressed in standard XML form, where it resembles an XML Schema. However it offers a very different ‘simple outline XML’ format (SOX) that some may prefer when working with what might otherwise be very deeply nested elements. The genus element might then look like
element genus { text }

Technique: XSL Stylesheets

If there is any point in putting your content into XML, you need to deliver it as styled documents, such as HTML web pages and PDF files for print and local reading.

Simpler jobs, such as turning a basic XML article into a vanilla web page, can be accomplished quickly using CSS stylesheets, but in most real-world circumstances you will need a more sophisticated system of Extensible Stylesheet Language (XSL) stylesheets. These come in two main forms: XSL Transformations (XSLT), and XSL Formatting Objects (XSL-FO).

oXygen supports a wide range of transform and processing options.
oXygen supports a wide range of transform and processing options.

An XSLT stylesheet is an XML document that acts as a template of rules to determine how XML content will be marked up in an output file, such as HTML to be rendered by a web browser. By convention, genus and species names are set in italic, so the instructions for formatting the genus element into HTML might simply read
<i><xsl:apply-templates/></i>
which would generate the HTML
<i>Absconditella</i>

A separate stylesheet is thus required for each output format, to generate the appropriate mark-up tags and the like. Conversion is accomplished offline, although some browsers, including Safari and Firefox, contain their own XSLT rendering engine that permits client-side conversion.

XSL-FO takes this a step further, in supporting sophisticated page output as might be required for pre-press PDF. It too starts with the XML source and XSLT transform files, from which an XSL-FO document is produced and passed to the FO processor to create usable output.

This additional step allows individual pages to be rendered, XSL-FO being strongly page orientated. You can use multiple columns, footnotes, tables, bookmarks, and output can contain tables of contents, indexes, and other composed content. It has only crude typographic controls, though, being intended for technical documentation rather than design style.

Updated from the original, which was first published in MacUser volume 28 issue 15, 2012.