Words on Macs: 3 From text to documents

Having acquired our text content, the next task is to structure and format it so that it can be used and accessed.

Why structure text?

Building structure into your documents, particularly anything longer than a few pages, is not only important for the reader. Large plain text files can be searched, but even the smartest of indexing and search engines can only return all the hits that they get. If you have entitled a chapter “Ceramics”, then you would expect a search using the term ‘ceramics’ should return that chapter high in its list of hits. However it can only do so if it knows that that word was a chapter title.

Structure is essential for more sophisticated operations such as the construction of tables of contents, footnotes and their numbering, cross references, and more. In books which use hierarchical chapter and section numbering, even chapter-based page numbering, you must structure your content properly, or face a long and tedious task of numbering these manually.

Making substantial changes to a document also becomes much easier when it is structured properly, and maintained in an editing environment which handles that structure. If you were writing a textbook and decided that you had to move the current chapter 8 forwards to follow on from chapter 3, it is far easier if you can let the app handle all the changes required to implement that.

Even relatively minor operations, such as changing the font or size of text used to style chapter or section titles, are made much simpler if your editing app knows which headings are chapter titles, and which are section headings. Again, making such changes manually is tedious and prone to error. This is a common fault in untrained users of Microsoft Word who do not set up and apply styles to different types of text content, but format each heading, etc., individually.

There are more specific circumstances in which text has to be structured very carefully. This usually applies to most technical documents, such as methods which are required for compliance with standards, anything scientific or engineering, many textbooks, catalogues, documents assembled from database content, and anything which needs to be stored in structured form (e.g. in a database).

Which format should I use?

There is neither a perfect format for every purpose, nor is any document editor perfect. Each format has its strengths and uses, and its disadvantages. There is also a very strong link between the editor and format.

Most important considerations include:

  • required output format(s) ((X)HTML for the web, PDF for publishing, ePub for electronic books),
  • need to structure content (for retrieval, databases),
  • control over layout (pagination, typography),
  • non-text content (figures, tables, audio, movies, links),
  • extensions (indexing, tables of contents, footnotes, references, cross-references),
  • collaboration and mobility (platforms, annotation and editorial tools),
  • linked content, e.g. from a database,
  • industry or corporate policy (see below),
  • personal preference.

In many situations, a job or organisation will require that all text-containing documents are edited in a common format, often using the same app, normally Microsoft Word. Although Apple used to maintain all its technical documentation in Word, it is by no means the best solution for many documents. Requiring anyone to use a single tool for all types of text documents is as senseless as limiting every tradesman (electricians, carpenters, plumbers, etc.) to using one common tool such as a hammer.

What is important is ensuring that those who need to read the document source can work with a common format. That is not difficult, even if they use different platforms and apps.

XML

In theory, XML should be the only format that any of us uses, as it is so versatile and capable.

Extensible Markup Language (XML, Wikpedia) was derived from an even more generic and powerful system, Standard Generalised Markup Language (SGML), which had proved itself a little too general for common use by normal mortals. SGML is still supported by Adobe’s venerable FrameMaker technical publishing platform, which is no longer available for OS X.

In XML, document types are defined by DTD files, which can almost do anything. For example, here is a fragment which uses a special grammar DTD:

<Subject><SubjectS>A baby</SubjectS><ClauseEmbed>
<Subject><SubjectS>who</SubjectS></Subject>
<Finite>
<Negative>won’t</Negative>
</Finite>
<Predicator>stop crying</Predicator>
</ClauseEmbed>
</Subject>

Contents set in XML can be as structured as you wish, and generally the reason for using XML is to preserve that structure. However the output format is quite separate from structure. Although XML source can be used to generate almost any output format, you need to apply a transform which defines the rules which format document content. For example, you might want section titles in XML to appear using a set heading style in HTML, in which case that must be written into the rules for generating HTML from your XML source.

There are many major document systems which are supported by XML, including Docbook, DITA, and TEI. These come with stylesheets to support standard transformations to HTML, PDF, and other formats, and can be customised to work exactly how you wish. There are also dedicated XML databases like BaseX which can store content in XML’s tree structure.

You can still edit XML in the traditional way when you need, in Oxygen.
You can still edit XML in the traditional way when you need, in Oxygen.

However XML is not friendly, can be ponderous at times, and has lacked good visual editing tools. This has changed recently as the best OS X environment for working with XML, Oxygen, now has good support for a form of WYSIWYG editing. Despite that XML tends to be preferred for long, complex, technical documents, and where a single common source is to generate several output formats, for example web, print, and electronic books. In those circumstances it should be your first port of call.

Oxygen now sports a WYSIWYG editor for XML documents.
Oxygen now sports a WYSIWYG editor for XML documents.

TeX

Pronounced tech with the ch as in a softened version of Scottish loch, TeX (Wikipedia) was one of the first typesetting languages intended to be used to convert input text into printed papers and books. As it was also one of the first major open source projects, almost all of its tools are mature and completely free. The major distribution for OS X is MacTeX, from the TeX Users Group here, and is centred on the TeXShop app.

TeX uses a system of document classes which act as stylesheets, and the great majority now work with an overarching system called LaTeX, originally intended to make use easier than raw TeX, but without removing its underlying power and flexibility. There is a huge range of different stylesheets available to support specific document structures and output destinations, particularly academic journals, dissertations, and books.

The following excerpt uses the stylesheets originally developed for Edward Tufte’s books on data visualisation:


\begin{figure}
\includegraphics{Fig3-51.jpg}
\caption[April Gornik, \textit{Cloud Lake} (composite) (2000)][6pt]{\textbf{April Gornik, \textit{Cloud Lake} (composite) (2000), oil on canvas, 193 x 241 cm. (Private collection).} This composite view confirms that the painter has added to the reflected image a bank of cloud apparently in front of the trees.}
\label{fig:fig351}
\end{figure}

Apart from these very rare examples of the manipulation of a reflected image, Turner's (Figures~\ref{fig:fig315} and~\ref{fig:fig317}) and possibly Sisley's (Figure~\ref{fig:fig326}), outside the works of \cez\ and Peter Doig I have been unable to find other instances of a well-known painter intentionally altering reflections so as to conflict with optical principles. Doig's paintings are considered in detail in Chapter~\ref{ch:doig}.

TeX has survived because of its strong and persistent use in academic publishing, its ability to cope with huge and very elaborate content with hundreds of illustrations, complex tables and mathematical notation, and its preservation of structure within content: stylesheets typically contain provision for all the bells and whistles expected in technical publication, including tables of contents, indexes, references, footnotes and sidenotes. It also produces typographically very high quality output.

TeXShop, like all good TeX environments, requires you to edit in text source code, then to typeset when ready.
TeXShop, like all good TeX environments, requires you to edit in text source code, then to typeset when ready.

However TeX uses its own programming language, and although there have been real time editing environments (including the once-wonderful Textures for Mac OS Classic), it is still used by editing the source ‘code’ and then running that through the formatting engine to view output, typically in PDF. Modern TeX environments make this very quick, but it is not quite interactive, and errors in the source can be very frustrating, particularly for new users. It is not fault-tolerant, and good graphical design tools for tables are lacking.

Teashop then generates an absolutely gorgeous PDF file, ready to view or print.
Teashop then generates an absolutely gorgeous PDF file, ready to view or print.

TeX remains a major format which should be a serious consideration if you want print-quality PDF output and need its power. But you must be prepared to learn its language, at least sufficient to use prepared document classes, or it would prove hugely frustrating and counter-productive.

HTML

This is how the excerpt looked when rendered from HTML in Safari.
This is how the excerpt looked when rendered from HTML in Safari.

There is of course no single HyperText Markup Language (HTML, Wikipedia), it having fragmented into different versions, with or without XML (XHTML), stylesheets (CSS), and more. Despite that, it is, after plain text, the most common format for text-based documents, even if most reside on servers which you access using a browser.

Here is some example vanilla XHTML:


<p>Transcribed from the 1901 Cassell and Company edition by David Price,
email ccx074@coventry.ac.uk.  Proofing by David, Dawn Smith, Uzma,
Jane Foster, Juliana Rew, Marie Rhoden and Jo Osment.</p>
<h1>SEVEN DISCOURSES ON ART<br />
by Joshua Reyonds</h1>
<h2>INTRODUCTION</h2>
<p>It is a happy memory that associates the foundation of our Royal
Academy with the delivery of these inaugural discourses by Sir Joshua
Reynolds, on the opening of the schools, and at the first annual meetings
for the distribution of its prizes.</p>

‘Plain’ HTML is extensively supported by editors and viewers, but is surprisingly little-used except for the production of web pages. This is puzzling, because when used in conjunction with appropriate stylesheets, it can retain a reasonable amount of structure within content. However for most it may represent just the wrong level of compromise: significant effort in marking up, but insufficient reward in comparison with TeX or a proper layout app, and poor control over design and typography. It is therefore most usually an output format generated from another more specialised master.

MarsEdit's source code editor gives access to most vanilla HTML tools.
MarsEdit’s source code editor gives access to most vanilla HTML tools.
MarsEdit also provides a browser preview, making it excellent for blogs.
MarsEdit also provides a browser preview, making it excellent for blogs.

I write this blog using MarsEdit, an excellent HTML editor particularly for blogs.

Markdown

iAWriter, which runs beautifully on iOS too, uses Markdown.
iAWriter, which runs beautifully on iOS too, uses Markdown.

Markdown (Wikipedia) has become popular relatively recently, and is seen as a quick and dirty way to put a little structure and formatting into documents which remain essentially text only. As a convenient intermediate between plain text and the likes of HTML and RTF, it provides an easy way to the latter two, without burdening the user with complex markup. There is no standard for it yet, and different implementations offer non-standard extensions.

Ulysses is another minimalist editor with Markdown support, available for iOS too.
Ulysses is another minimalist editor with Markdown support, available for iOS too.

Although ideal for quick notes and putting light formatting into plain text, it lacks features to support embedded images other than through links, tables, and so on. Editors which specialise in supporting markdown include iAWriter and Ulysses, which also run on iOS and integrate with iCloud, so that you can work on the same documents wherever you are.

RTF

Rich Text Format (RTF, Wikipedia) was originally intended for exchange with Microsoft applications, particularly Word, before the latter’s binary file format was released. Many editors support import, export, and storage in RTF, and after plain text it remains the most popular format for movement between editors and platforms.

RTF was not designed for users to edit directly, as this sample demonstrates:


Tools: Text Encoding Converter, Encoding Master or Peep to convert between codepages. TextSoap to clean up ligatures and other mess.\par
{\*\shppict {\pict {\*\nisusfilename textencodconv.png}\picw1202
\pich815 \picwgoal24040 \pichgoal16300 {\*\picprop {\sp {\sn fillOpacity}{\sv 65536}}}{\sp {\sn fShadow}{\sv 0}}{\*\nisuspicprops {\sp {\sn nisusDontClipToLine}{\sv 1}}{\sp {\sn dxWrapDistLeft}{\sv 0}}{\sp {\sn dyWrapDistTop}{\sv 0}}{\sp {\sn dxWrapDistRight}{\sv 0}}{\sp {\sn dyWrapDistBottom}{\sv 0}}}\pngblip
[here is a large image in textified code]
}}\par
{\qc \fs22 Use TextSoap to clean up remaining mess in text files, with its built-in and custom scripts.\fs24 \par
}\par
{\f2\fs26\b How does Unicode encode text?}\par

Bean is free, and a very good editor based on RTF.
Bean is free, and a very good editor based on RTF.

RTF does not contain good structured content, but focusses on formatting for display and printing. It is invariably edited using a graphical front end, and never as source code, and is therefore most useful for file exchange and output. Bean is a fine and free RTF-based editor for OS X, and of course it is supported by bundled TextEdit. Nisus Writer Pro is very sophisticated, with book tools, and has excellent support for mixing languages and scripts.

Niseis Writer Pro has excellent book and other power tools, and is ideal for multi-language documents.
Niseis Writer Pro has excellent book and other power tools, and is ideal for multi-language documents.

Proprietary

The major layout apps, Adobe InDesign and QuarkXPress, originated as tools for the design and layout of shorter documents, offering strong support for typographic controls, working with images, and generating output which can go straight for print production. Over the years they have gained support for some structuring of content, XML formats, database links, and a range of output including electronic publications.

Adobe InDesign does now work with XML structure, but remains focussed on layout rather than content.
Adobe InDesign does now work with XML structure, but remains focussed on layout rather than content.

Unlike Adobe FrameMaker (Windows), the emphasis in these apps is still on appearance and design. For example if you try exporting a book developed in InDesign to its XML-based exchange format, you will not be able to do much with that content as XML. However when it comes to supporting output for print, or for electronic distribution, these are complete and thoroughly professional publishing platforms.

Microsoft Word 2015 preview bristles with power tools, if you know how to get the best out of them.
Microsoft Word 2015 preview bristles with power tools, if you know how to get the best out of them.

Microsoft Word has long been the standard word processor for commerce, government, and most larger organisations. Although it has a larger toolset than any other word processor, getting the best out of those tools is not always easy, and you may well find that you get on better with a different app instead. Its new XML-based document format, like that of the other office suites, is not amenable for use as XML-structured content, for example in an XML database, or for standard XML transforms.

Apple's Pages is an eminently usable layout app aimed at the non-professional.
Apple’s Pages is an eminently usable layout app aimed at the non-professional.

Apple’s Pages is very pleasant and more natural in use, and capable of producing very good-looking documents of considerable length and complexity, although it is not intended for more technical work. Nisus and other apps have their purposes and advantages too, and use common document formats such as RTF and RTFD.

Overall, it does not matter how attractive the app is, more whether it locks you in to products which will then be dangerously indispensible. However when you do need to do something for which they are weak, you will find it very hard to move to a different product. There is thus a lot to be said for the permanence of the major text document formats, and remaining as agnostic over platform (hardware or software) as possible.

PDF

Adobe Portable Document Format (PDF, Wikipedia) was intended to provide a common format in which content, layout, and typography would be fixed, and essentially immutable, for instance to support electronic output for publication. Although you can edit PDF documents, and are intended to do so with PDF forms, of course, there are many constraints which make it a format for output rather than document editing and development.

Early implementations of PDF stored text content in embedded text passages, making it feasible to hack raw PDF code (a subset of PostScript). Unfortunately the great majority of apps now write PDF content in binary format, which has stopped that rather perverse pursuit.

PDF does not necessarily contain Unicode text either, and sometimes old PDF documents contain non-Roman scripts can be a challenge to open using modern systems. More recently tools and apps for converting between PDF and editable document formats have become plentiful; these can be infuriating when working with multi-column text, though, and unconventional page layout. As PDF places little or no emphasis on the structure of content, but is most concerned with its layout and appearance, extracting text and other formats may not be a wholly satisfying or efficient exercise. It is always better to retain original document formats, rather than relying on a PDF yielding its source easily.

Plain Text

Plain text is inherently unstructured and without markup of any kind, which could introduce structure or format. It is the most basic level of text document, and the lowest common denominator when all else fails. Of course if you are writing a novel or script which will only be laid out when being prepared for publication, you may prefer it for its sheer simplicity. For anything else it is inadequate.

BBEdit has superb features for working with plain text, and all forms of marked up text.
BBEdit has superb features for working with plain text, and all forms of marked up text.

The most powerful text editor for OS X remains BBEdit, although there are several others which can be customised almost as extensively.