The Internet is frighteningly ephemeral. The site you found so useful today may have vanished next week, and the chances of it still being there in a couple of years may be slim. This article looks at how best you can keep a more permanent record of individual pages, rather than whole sites, including longer-term archiving.
Other browsers offer different options, but in current releases of Safari your choice for saving web pages includes:
- File/Save As…/Page Source to save the page as a single HTML source file.
- File/Save As…/Webarchive to save the page as a single Webarchive file.
- File/Export As PDF… to save the page as a single PDF file, in display format.
- File/Print…/Save as PDF to save the page as a single PDF file, in print format.
This is the smallest and least complete version of the four, as it contains the HTML source of the page, and omits all linked and similar generated content. For relatively plain pages containing text exclusively, this can be useful, but that is now appropriate for few pages. The saved file can be opened in Safari or another browser, and so long as none of the linked content is missing or changed, you should see the original page reconstituted faithfully.
This is almost certainly not what you’d want as a more permanent record such as an archive.
This is an opaque file format that assembles the entire content of the page, including embedded images and other content, but not linked downloadable files, in a single incomprehensible text file.
Although this format is peculiar to Safari, it has limited support by some other browsers, which can read it, and a few utilities which can, for instance, convert a Webarchive to PDF. However, with a single exception, those utilities are now old and their future uncertain. It’s probably best to consider this format as being proprietary to Safari.
I’ve not been able to find any description of the format which could be used to implement a third-party reader, and as it doesn’t appear to comply with any open standard, the format may well have changed over the last few years. In the past it has been well-supported by the macOS API, but currently all those existing calls to work with Webarchive files are marked as being deprecated by Apple, so are likely to be removed whenever it wishes, making it impossible to use those deprecated calls in future versions of macOS.
The good news, though, is that Webarchives are now making their way into Apple’s open source WebKit. Currently, support is limited to writing but not reading them from WKWebView from macOS 11 onwards, but hopefully that will be extended in the future.
Because Webarchives contain all linked content, and are encoded in text representation of binary data, they’re by far the largest of the four options. They’re probably the best way to save web pages for the time being, but their format may well not be supported in just a few years, so they’re unsuitable for archives intended to last more than that period.
In case you weren’t aware, there are two different routes in Safari to turn a webpage into a PDF document: directly using the Export As PDF… menu command, and indirectly via the Print… command then saving as PDF from the Print dialog. The results are quite different.
Exporting as PDF creates a document in which the entire web page is on a single PDF page, which may not be what you had in mind. The advantage of this is that the PDF is one continuous page without any breaks, and is a faithful representation of what you see in your browser, complete with its original layout and frames. The disadvantage is that this won’t print at all well, imposing page breaks in the most awkward of places. Very long pages can also prove ungainly, and difficult to manipulate in PDF utilities.
Printing to PDF breaks the web page up into printable pages, and splits up frames. What you end up with isn’t what you see online, but could at a push be reassembled into something close to the original. That isn’t too bad when the placement of frames isn’t important to their reading, but if two adjacent columns need to appear next to one another, this layout is likely to disappoint. It is the best, though, for printing, with headers and footers and page numbering too.
Recently some have remarked that PDFs generated from Safari lose their embedded links. That shouldn’t happen: well-constructed web pages should preserve all their original links when viewed as PDFs.
PDF documents are easy to create in macOS, as Safari uses Quartz graphics to render web pages, and PDFKit is also part of Quartz graphics. The two alternatives result from Safari rendering its displayed content through PDFKit, or rendering its content as prepared for printing.
While PDF is one of the preferred formats for archiving laid-out documents, it’s worth bearing in mind that standard macOS PDF isn’t compliant with any of the PDF/A standards for archival documents. You’d need a high-end PDF editor such as Adobe’s Acrobat (Pro) CC to prepare and save to any of those.
- Save As…/Page Source is of limited use, mainly for text-only pages without embedded content such as images.
- Save As…/Webarchive is excellent for day-to-day use, being complete and faithful, but isn’t an open standard and its future may be limited. It’s therefore not recommended for archival purposes.
- Export As PDF… is excellent for day-to-day use, complete and faithful, but for serious archival use needs to be converted to comply with an archival standard in the PDF/A series.
- Print…/Save as PDF is an alternative most suitable if you’re going to want to print the document out.