Just over a week ago, those of us who watch the UK’s figures for Covid-19 were shocked. It was eventually revealed that, as a result of an error in handling reports of positive test results, official figures had underestimated total numbers of positive cases in England by just under 16,000 over the previous week or two.
Since then, many have expressed opinions over the cause of this error. Although some details remain vague, and I haven’t seen any official account explaining exactly what went wrong, the gist seems to be this. PCR testing in the UK is performed by different laboratories, including some that are operated by the NHS and Public Health England (PHE), and others which are run by private enterprise, including commercial arms of universities. They send reports of all those who test positive to PHE, who then collate them to produce daily figures which are published on the official website.
Some of the reports which PHE receives are sent as plain text files using commas to separate their data fields (CSV). PHE collate these using the Microsoft Excel .xls format, and appear to generate their final totals using Excel too. Until the middle of September, PHE’s system appeared to work. Although there were very high numbers of cases back in March, April and into May, limited testing capacity at that time meant that few had PCR tests, and the system never had to cope with more than 6,500 positive test results in a day. Prior to 24 September, the highest number of positive tests reported in a single day was 6,201 on 1 May.
Testing capacity and policy have since changed, which brings reported cases closer to those predicted from surveillance studies. Then on 24 September, with the UK well into its second wave of infections, the number of positive cases in a day passed 6,500 for the first time. Oddly, for the next week, the number of positive cases each day didn’t change much, and by 2 October was hovering around a peak of about 7,000 cases per day.
That Friday night, 2 October, it was realised that the process for handling incoming positive results had effectively capped the total by omitting a total of 15,841 positive results.
So what was the real cause of this catastrophe: was it using the old .xls format, an old version of Excel itself, or a clumsy system for collating the data? My answer is that it was none of those, but a complete absence of cross-checks, something which I and many others have advocated for years. Back in 2013, I wrote an article for MacUser magazine which I have republished here. It explains the well-known dangers not just of using spreadsheets but of handling numbers on computers.
PHE should have been running a system which incorporated cross-checks. For example, in addition to receiving reports on all positive test results from laboratories, it should have required those labs to provide a daily total of positive tests which had been reported under that system. One quick way to check that the total number of positives on any day is correct is to compare the total of reports received with the sum of the lab totals. Had PHE implemented such a cross-check, then this error would have been obvious on 24 September, not 2 October. Those eight days would have made a huge difference to both contact tracing and understanding what is happening in the second wave.
The fact that PHE didn’t incorporate any such cross-checks indicates an approach to data which can only be described as cavalier. In healthcare computing in particular, it’s essential to design systems which are more robust, as the consequences of failure or error are so much more serious. While I’d feel sorry for a small business discovering that a preventable spreadsheet error caused underpayment of tax, no one in healthcare computing should ever risk that sort of failure, as it can lead to suffering and death. Few who have commented on this particular error seem to have been aware of this: healthcare computing is different.
There are more fundamental problems in PHE’s epidemiological data, though, which have puzzled me since their earliest reports: their need for repeated adjustments of historical data.
Originally, PHE Covid-19 results were spread across two separate sites, one which provided figures on testing, the other which charted case and death numbers and the like. The testing site repeatedly referred to adjustments being made to historical data to correct errors in them. This lives on even among the huge error detected on 2 October: in the adjustments made to figures as a result, PHE reported that they had removed a case from 3 March, added 4 more cases to the total for 31 March, eight for 18 April, and many more.
This fiddling around with historical data amounted to a total of 49 cases in one day. Although in the grand scheme of days with more than 10,000 fresh cases, this is small fry, I can’t understand how the testing labs or PHE can possibly still be correcting errors from six months and more ago, however small. I think that questions remain over the whole process by which PHE gathers and analyses these crucial data. Without proper cross-checks and data quality standards appropriate to healthcare computing, the next glaring error can’t be far away, and may not be as obvious to detect.
It’s not a question of the tools, but how they are being used. Back in the day when we all used more limited versions of Excel and .xls files, we didn’t all lose data in the way that PHE has here. All tabulations and calculations are vulnerable to error. If no one implements a robust system of cross-checks they will fail again.
In view of the magnitude and consequences of this error, there must be an immediate public inquiry in which PHE’s whole approach to handling of vital healthcare data is examined by people who do know what they’re doing. Otherwise the clock is already ticking away to the next catastrophe.