If you have thousands of laid-out pages but now need their content to be structured, all is not necessarily lost.
For over a decade we had completed forms laid out using Adobe InDesign. Although key elements were placed in separate boxes, they were set up to look good in print, and lacked any structure in their content. The task was to turn more than 7500 individual files into a database, in which we could search, sort, and analyse – something conventionally viewed as requiring prolonged labour and great cost.
I had just three days, and the tools to hand on my iMac.
Searching for structure
When confronted by apparently unstructured content, the first quest is to find a format in which it can be rendered with some structure, even if that is not as fine-grained as you might wish. The most obvious routes are often the least useful, though: exporting these forms as XML threw the entire content away, and HTML was little better, as it simply concatenated the text into a meaningless series of paragraphs. Conversions from PDF files are commonly as unhelpful.
But InDesign and similar apps usually support an exchange format, in this case Adobe InDesign Markup Language (IDML), an open XML format documented by Adobe here, with an explanatory cookbook here.
IDML files turn out to be zipped archives, within which each text element is stored in a separate file in the /Stories folder. As we had used the same master layout and text boxes in every document, individual content within every IDML file was to be found in XML files of exactly the same name: for example, the text placed in the Surname box was always placed in the file named Story_u9dc.xml.
If you have laid out text boxes in a different order in each document, you will have a tougher task of rummaging through the many files in /Stories to locate the content that you want.
Extracting the content from each of the files in /Stories is messier. Being XML, if you are a dab-hand with XSL Transformations and have a tool such as the Oxygen XML Editor available, then you may find it best to use that combination. However the XML dressing is unlikely to be of much interest, so an alternative approach is to use AppleScript or another scripting language to strip out all except for the text contained between the
The solution that I used was a sequence of two AppleScripts: the first to open each document in InDesign and save it in IDML format, the second to unzip each IDML file into a temporary folder, and strip out each item of content from the requisite story files, ready to write into my database.
Conversion to IDML format proved the slower step, and the script had to be fed around 150 files at a time or the Finder choked when generating the list of files to process. With care I found that I could push over a thousand files an hour through, already consuming the first of the three days allowed.
My second script used the
do shell script feature of AppleScript with the shell command
unzip to open each IDML file into its constituents. Injudicious choice of some file names caused that to choke occasionally, a not infrequent issue with scripts that assemble and execute shell commands on filenames gleaned from the Finder.
You may be more fortunate and discover a short cut to ease your own migration. For example, Productive Computing offers a range of plug-ins to capture data from Mac apps direct into FileMaker Pro. In the next and last article in this series I will explain how that can be accomplished using AppleScript.
Tips and Tools
Rummaging around in elaborate document formats and stripping out structured content needs good tools and efficient techniques.
One key method is to create a set of test documents that will make it clearer exactly which content ends up where. Some laid out pages place very similar or identical material in two or more locations. Entering content that labels exactly what is where will leave you in no doubt where that content can be found in the files that you then process.
Where master pages or a template have been in long use, perhaps through several different versions of the page layout application, you will also need to test a cross-section of documents of different ages.
AppleScript remains an excellent choice of scripting language, with many surprisingly elegant and concise features. However Apple’s free vanilla script editor is not suited to this type of development: the error-handling and debugging facilities of Script Debugger make it an essential investment if you want to do anything beyond the most trivial.
Sadly Apple has backed off improving support for AppleScript in its own Xcode SDK, although that remains the environment of choice when you are developing proper applications that employ AppleScript.
Updated from the original, which was first published in MacUser volume 29 issue 12, 2013.