Convert text between file formats, including webarchives

QuickLook makes it easy to preview most files, and TextEdit will display the text content of many formats. There are times, though, when it’s more convenient to extract the text content and save it in a different format, for example turning a Safari Web Archive or Word document into Rich Text. Thankfully, there’s a tool to do that in Terminal, textutil.

textutil is one of the older command tools, and was introduced in Mac OS X 10.4 Tiger twenty years ago. Despite that, it remains one of the most underused in modern macOS. It works by tapping into the macOS text system, using any of the following nine formats:

  • plain text (txt)
  • HTML (html)
  • Rich Text, RTF (rtf)
  • RTFD (rtfd)
  • Microsoft Word .doc and .docx (doc, docx)
  • Wordprocessing Markup Language, WordML (wordml)
  • OpenDocument Text, ODT (odt)
  • Safari Web Archive, webarchive (webarchive).

The name given in parentheses is that used in these commands.

The quality of format conversions is high, essentially the same as you’ll see in Apple’s apps. For example, here’s an original Word .doc file:

and here is a conversion to RTF using textutil:

If the original file contains embedded images or other non-textual content, though, those aren’t included in the output.

Display information

This is the simplest option, used as
textutil -info filename
where filename is the path and file name.

This displays basic information about the file, including its word count, and any metadata.

Format conversion

This extracts the text content of a file in one of its supported formats, and writes that out in a different format, as in
textutil -convert rtf filename
where filename is the path and file name. The output file will then have its extension replaced appropriately, for example
textutil -convert rtf myfile.html
will create the file myfile.rtf containing a Rich Text representation of the HTML file myfile.html. If you want to create a different output file, use a command like
textutil -convert rtf filename -output filename2.rtf

Only text content is written to the output file.

Joining files

textutil‘s other main feature is joining text-based files together to form a single file consisting of the input files concatenated together, as in
textutil -cat rtf -output filename.rtf -- file1.rtf file2.rtf file3.rtf
concatenates the three files file1.rtf file2.rtf file3.rtf into the single file filename.rtf in Rich Text format. You can also include implicit conversions such as
textutil -cat rtf -output filename.rtf -- file1.html file2.rtf file3.html
where the first and last parts of the single output file filename.rtf are converted to RTF before concatenation. Note the -- before the list of input files consists of two hyphen characters, not a dash.

Further options

Advanced options detailed in man textutil and textutil -help include:

  • change text encoding from the default of Unicode UTF-8,
  • change font and size,
  • exclude HTML elements,
  • specify metadata.

In macOS Tahoe you may also encounter warnings relating to font availability and substitution.

Summary

  • textutil -info filename for information;
  • formats txt, html, rtf, rtfd, doc, docx, wordml, odt, webarchive;
  • textutil -convert rtf filename for conversion;
  • textutil -cat rtf -output filename.rtf -- file1.rtf file2.rtf file3.rtf for concatenation.

To make this accessible from the GUI, I am working on a wrapper app named Textovert.