Converting between the major text document formats isn’t always as simple and accessible as it should be. I’ve tended to use specific apps which (I think) handle different formats better than others, but there are some, like webarchives, which aren’t so easy to work with. That was until I discovered that macOS has a very good conversion tool built into it –
textutil uses the conversion libraries within Cocoa to effortlessly convert the text content of documents between any of the following: text, HTML, RTF, RTFD, Word .doc, Word .docx, WordML, ODT, and Webarchive.
The next time that you want to strip out the text content of a web page, simply save it as HTML or a Webarchive, and type a command like
textutil -convert txt myWebPage.webarchive
You’ll find its text written to myWebPage.txt.
There are some things that
textutil doesn’t handle: as the name suggests, it doesn’t convert any embedded images or graphics. There are the usual issues over the formatting of tables, equations, and other special blocks of text, although the following example shows that it does pretty well with these.
The original document was in Word .doc format, above, and converted to RTF, below, surprisingly well.
The main use for conversion is simple to invoke:
textutil -convert format input
format is one of
txt, html, rtf, rtfd, doc, docx, wordml, odt, or
input is the full name of the file to be converted.
You can also use it to concatenate two or more files, with or without conversion, using
textutil -cat format -output outputfile inputfiles
inputfiles is a list of files, such as *.rtf,
outputfile is the name of the file to be created, and
format is as above.
There are a lot of additional options, detailed in
man textutil. Two of the more useful ones specify the font and size to be used when converting plain text to RTF:
textutil -convert rtf -font Times -fontsize 11 filename.text
will set the text from filename.text in 11 point Times when converting it into RTF, for example.
You can also use this tool to show information about a file in any of the supported formats:
textutil -info myFile.docx
for example might return
Type: Office Open XML format
Size: 126039 bytes
Length: 3408 characters
Author: Howard Oakley
Last Editor: Howard Oakley
Created: 2015-03-06 09:06:00 +0000
Last Modified: 2015-03-06 10:48:00 +0000
Contents: Two very short synopses...
If you use
textutil to edit the metadata of a document, note that those options refer to those metadata which are embedded in the document data itself, such as the Title of a Word .doc or .docx file.
textutil cannot change the extended attributes of the files which it converts.