Text file formats

Many of our documents contain text in some form. The type or format of that text makes a big difference to what we can do with that text, and how it is displayed. This article explains the main types of text document.

Plain text

A document containing plain text contains only Unicode characters, and is edited as plain text. In this strict sense, all the characters in the file are content, and none are used as metadata to structure the content in any way, for instance as mark-up. Thus every character in the contents of the file is rendered literally, according to its Unicode meaning, and none are treated any differently, or concealed from the user.

This is made more complex by the fact that some Unicode code points change the way in which characters are displayed. These include combining characters, which are most commonly diacritical marks that are combined with other characters to produce compound characters which can also be the result of typographic manipulation. Thankfully these are unusual in most languages.

Plain text files without any mark-up are used extensively. Although they commonly have the distinctive extension of txt or text, custom extensions are also used for special purposes, such as source code in differently programming languages.

As plain text can’t contain any type of styling, but often benefits from techniques such as colouring content according to its syntactic role, it’s best edited using a dedicated editor. Examples include BBEdit and CotEditor.

Saving any form of styled or rich text as a plain text file inevitably strips all attributes and metadata; if you want to preserve those, the document must either be converted to a marked-up form, or saved as rich text. Syntactic colouring is also a feature of the text editor being used, and not preserved in plain text. Open the same file with a different text editor, and such colouring could be different or absent altogether. Furthermore, few (if any) text editors that support syntactic colouring can save those styled files as rich text, or another format which does preserve the colouring.

text91

Syntactic colouring of Swift source code in Xcode.

text92

If you save the styled (rich) text from the window behind in plain text format, it will lose all its colours.

Marked-up plain text

In addition to containing the Unicode characters of the document contents, these documents contain metadata in the form of mark-up commands, still composed of Unicode text, used to structure the contents for rendering, display, or content analysis. The mark-up language imposes restrictions on the use of certain sequences of characters to distinguish the content from its mark-up, and other conventions, and those may cause errors when used to display plain text which isn’t marked up according to those conventions.

Popular mark-up languages include LaTeX, HTML, XML, Markdown and its variants. To distinguish between these, files normally bear a language-specific extension, such as tex, html, xml or plist, or markdown.

text93

Markdown is far simpler than regular mark-up languages, and easy to render in real time, as shown here on the right.

While some WYSIWYG editors are available for specific types of mark-up, these formats are also commonly edited as source, complete with their mark-up. This distinguishes them from rich text and complex XML-based formats such as OpenXML and OpenDocument formats, which are almost exclusively edited in rendered form.

Rich text

Although rich text files appear to be simply marked-up plain text, they are not intended to be edited as plain text, but in their rendered form using a dedicated editor. macOS includes extensive support for the internal representation of rich text as Attributed Strings, Unicode text with associated attributes. Those attributes include visual styles with font selection and styling for display, accessibility features, and hyperlinks.

Rich text also supports the embedding of non-text content including images. There are two different approaches to accomplishing this: the whole content can be presented in a single RTF file (rtf), or the text content can be kept separately from images in an RTF Directory (rtfd) bundle, the latter being preferred by many Mac apps.

If you look inside an RTF file containing embedded images, you’ll see each image as a huge block of hexadecimal, as a \pict item, and hyperlinks within the document are declared in place as fields. An RTFD package normally contains a single RTF source file named TXT.rtf and each of the images stored as a separate file. Within the RTF file those images are called in using code such as \NeXTGraphic followed by the file name. RTFD work well on Macs, where they prevent the RTF source from becoming too large with many embedded images. However, they don’t move so well across platforms, where a package isn’t recognised for what it is, and apps working with rich text may be unaware of how to handle RTFD.

Complex XML formats

XML is also used in more complex document formats that you’d never want to edit in their text source. These include OpenXML and OpenDocument, widely used by word processors, spreadsheets, and other apps. Because XML is so verbose, these formats often compress document source for storage in binary format.

Filename extensions and QuickLook

As I’ve shown above, different text formats are distinguished by the extension added to their filename. You can’t convert between different types just by changing their extension, though: if you try that with an RTF file and give it the extension of text, then all that happens is it will be opened by a text editor assuming you want to edit its source code, which you almost certainly don’t want to do.

If you’re uncertain which format a file is in, select it in the Finder and use QuickLook to preview it, either as a thumbnail in the Finder window or a larger preview by pressing the Spacebar. If QuickLook gets it right, then you’ll have a better idea of its format, and whether it’s basically plain or rich text, or something else. If QuickLook gets it wrong, then you may need to open it using a good text editor like BBEdit and work out just what format it is.

Giving text files the correct extension for their format is clearly important.