Layout to database 2: Building a Database by Script

Having found order in your InDesign documents (the previous article in this two-part series), you now need to turn their data into a database.

I have now used the first of three days allowed to rescue content from thousands of InDesign forms. As I entered the second day, I had converted each document into a nest of IDML content, and next needed to build that into the database.

Sometimes it is easiest to import data, as individual records or a collated list of records, using a common file format. FileMaker Pro has good support for this, accepting plain text separated by tabs or commas, or structured XML. As my records were largely text, sometimes several hundred words long, the prospect of assembling them into files for import was not attractive.

XML would probably have worked, and full details of the FMPXMLResult DTD are available to help that. But I am not that fluent in writing XSL Transforms that would have made this more efficient, and feared that I could lose a day or more trying to get those to work.

Scripting Import

It made best sense to keep to AppleScript: I would unpack the IDML, strip out the content required for each field, and script the insertion of data into a new record. This would be made slightly more complex as I had already designed the database with two major tables, one for personal details, the other for forms associated with each individual, linked by a unique ID field in the table of personal details.

Keep dates in databases using properly checked fields.
Keep dates in databases using properly checked fields.

This had the advantage that the resulting database could be accessed in full, or with personal details withheld, helping compliance with the Data Protection Act.

This time my script was fed by dropping IDML files onto it. Iterating through each in turn, it found the key personal identifiers in the converted form, then got FileMaker to search to see if that individual already existed in the table of personal details. If it did, it looked up the unique ID for that individual. If this was to be a new record in that table, it assembled the required information to complete the record and wrote them into a new record. This is coded along the lines of:
tell application “FileMaker Pro Advanced”
tell database “clienttrans1”
set newRec to create new record at table “clients”
tell newRec
set cell “surname” to theSurname -- and so on.

Cleaning up Content

One snag with working from the IDML files is the amount of munged content within them: non-ASCII characters such as smart quotes and the degree symbol become unrecognisable, and use of superscripts and subscripts embeds unwanted XML overhead.

Converting these to Unicode characters or stripping them out is made considerably easier with AppleScript’s text item delimiters feature. For example, to convert the > escape code to the greater than symbol >, you can use
set text item delimiters to {“>”}
set theListD to text items of theTextField -- strips the text into a list of text, broken wherever the escape code occurs
set text item delimiters to {“>”}
set theConvertedTextField to theListD as text -- reassembles the list into a block of text, with > appearing where the escape code had been before.

For those not used to this feature in AppleScript, it may appear to be a bizarrely indirect way of modifying text, but in practice, once you have got your head around the concept, it is very powerful and flexible. By the early afternoon of the third day, all my thousands of records were stored safely in the destination database.

However not all fields were in as good order as I wanted: I decided to come back to write further scripts to crawl through the database, cleaning, tidying, and using simple content recognition to split content out further.

Straightening Dates

Converting dates as text into a common format is a frequent requirement when cleaning up imported data. First split the components into a list using AppleScript’s text item delimiters, then work through the individual items, converting them into a standard format, such as 9 Feb 84 to 09/02/1984. Day numbers are easy to render, by converting them to an integer and prefixing a zero character if they are less than 10.

One ingenious method to convert a month name to its number is to find it in a string consisting of full month names padded out with hyphens so that each consists of ten characters:
ignoring case
set theMonthNum to (offset of (“-“ & the MonthName) in month_names) div 10 + 1
end ignoring

AppleScript to unscramble dates, viewed in Script Debugger.
AppleScript to unscramble dates, viewed in Script Debugger.

Years can be converted into four-digit format using a similar technique to that for the date, provided that you can make assumptions about which will precede 2000; if you are handling dates of birth for a wide age range, this will not be possible. In such circumstances, you will probably be best to ask the user to decide which millenium to use. Although tedious, if it allows you to convert the great majority of records automatically, it will have saved you worthwhile time.

Updated from the original, which was first published in MacUser volume 29 issue 13, 2013.