hoakley June 18, 2020 General, Macs, Technology

Good online charts can confuse not clarify: Covid-19 examples

There surely hasn’t been a time before in which online charts have been more important, and most frequently studied. During this pandemic, we have been treated to some of the best charts ever: those produced by John Burn-Murdoch and others for the Financial Times, and the extensive compilation at Worldometer stand out among many others. But from the outset they have had problems which have clouded our view of what has been happening. Two issues I discuss in this article are the data they use, and how they tackle noise to reveal trends.

I concentrate here on the reporting and charting of fresh cases of Covid-19. These are among the most important figures, as they can show trends in the spread of the disease earlier than other data such as deaths, which typically lag the date of positive testing by 2-4 weeks. Although there are many anomalies with published figures, there’s clear agreement on their definition as the number of people who test positive using a recognised method for detecting the virus on swabs.

If you want to watch for a ‘second wave’, or the impact of removal of lockdown restrictions, or the effect of holidays and festivals, fresh cases are the figures to watch.

Fresh cases of Covid-19

The most readily available figures for fresh cases in any country are those which it submits daily to the World Health Organisation, which are then collated and published. Unfortunately, they’re also more than slightly misleading. These are normally taken straight from the number of positive lab tests for that day, irrespective of when the original swabs were taken.

Delays between swabs being taken, on the actual date of the test, and the return of results vary considerably by lab, day of the week, date, and country. In some cases, test results are returned the same day, in others delays may amount to as much as a week or more. The best solution is for positive test results to be dated according to when the swabs were originally taken, effectively the moment of diagnosis. Few countries provide figures for those, and where they do, they are less accessible. In the UK, for example, Public Health England publishes these daily, but only for England and not the whole UK.

These better figures come at a disadvantage too: because they record positives on the day that swabs were taken, each day updates totals from previous days, and it takes a few days for the last test results to feed into daily totals. For example, the final total for today, 18 June, probably won’t be known until around 23 June, depending on reporting delays. But what you then end up with is well worth the wait.

Figures for England have another peculiarity: they don’t appear to include all the positive cases, as I have shown before. Quite why this is occurring seems buried in the methodology, but for the moment they still seem the best figures to use, where they’re available.

Charting, noise, and hebdomadal cycles

A glance at the raw figures for fresh cases in England by date reveals that they contain a lot of noise, and evidence of weekly or hebdomadal cycles, with figures for Saturdays and Sundays invariably being much lower than those for weekdays. One way to present these is by plotting cumulative cases over time.

This approach, which can use either a linear or logarithmic Y axis, is popular among those modelling disease transmission for its sigmoid shape. This example is taken from Worldometer, but like others it helps little with detail, and is very hard to read in terms of change, which is what we’re generally most interested in.

The great majority of charts shown on the Web therefore use smoothed lines, sometimes superimposed over bars or points representing the original values. The example below is for one city in England, Southampton, and uses the most common method of smoothing, a seven-day moving (or rolling) average.

Worldometer offers you the choice of either a three- or seven-day moving average, which seems attractive.

In general, because of the hebdomadal cycle, sites have adopted seven-day moving averages, which do take out much of the effect of the day of the week. They also bring their own problems.

A seven-day moving average is simply calculated by adding together the figures of seven consecutive days, and dividing that by seven. What, though, does that average represent most accurately? The smoothed number of cases at the start of the seven day period, at its middle, or end? Mathematically, it should generally be most representative of the point in the middle. That would necessarily mean that no moving average could be given for the first or last three days in the time series. Look up at the charts which have been published, and you’ll notice that their smoothed lines continue right up to the most recent data point. I suspect in most cases the moving average isn’t being taken for the middle timepoint in the seven, but for its last.

This has significant effects, particularly when the number of fresh cases is changing most, as seen in two examples taken from the data for England.

When cases were rising fastest, seven consecutive raw values were 1034, 1208, 2017, 2020, 2265, 2611, 2656, which average 1973, with a median of 2020, which also happens to be the middle of the series. Using that average to represent the first point of those seven is a clear overestimate, and for the last point it’s a clear underestimate. The same happens when case numbers are falling. Consecutive raw values were 3422, 2577, 2084, 3076, 2988, 2755, 2764, with an average of 2809 and median of 2764. Again, the average misrepresents, this time being too low for the first in that series, and too high for the last.

The effect of this smoothing error is visible in all charts which use moving averages: when the figures are rising steeply, the smoothed line lags the change by 2-3 days, and when they are falling, there’s a similar lag.

Seven-day moving averages do little to address problems of noise in the data. The resulting lines wander a lot, leading the viewer into drawing false conclusions about trends which disappear as rapidly as they became apparent.

This plot of the raw data for England is faithful but hard to read. Beyond its marked hebdomadal cycles and peak, it’s not easy to see much meaningful detail, or discern more subtle trends. My favourite treatment for this type of time series combines the raw data, shown as points, with a smoothed line using LOESS or local regression. In this, subsets of the data are used for a series of weighted least squares regressions, from which a line of best fit is constructed. A smoothing parameter is used to determine how closely the fitted curve matches the raw data, and is normally adjusted by eye.

Here’s my chart of the data for fresh cases in England, showing the raw data as red points, and the LOESS curve in black. This takes out both the noise and the hebdomadal effect, and suggests a series of trends as follows:

a sigmoid growth phase between 1 March and 6 April, with a maximum linear growth rate of about 1500 cases/week
a peak on 6 April
a linear fall from 9 April to 5 May at a rate of about 500 cases/week
a less steep linear fall from 10-31 May at a rate of about 250 cases/week
a least steep linear fall from 2-7 June
incomplete data from 8 June appearing as a false increase in gradient up to the last data point on 14 June.

England entered full lockdown between 21-24 March, two weeks before fresh cases peaked on 6 April. Those measures have been progressively eased since the middle of May, which may correlate with the slower reduction in fresh cases into June.

One important test of how reasonable is this LOESS smoothing is charting the residuals – the differences between the raw data values and those in the smooth curve, day by day.

This shows clearly the hebdomadal cycles which smoothing has removed, and how they have varied in magnitude through the time series. Trying to use moving averages to remove those large swings isn’t going to be successful. These residuals have an average (0.66) and median (3.7) close to zero, indicating the smoothed line is unbiased, and it can be seen to track the raw data points closely without any consistent time lag. Despite some large residuals, the standard error is only 23.

If you want to try better methods of fitting time series data, macOS has no shortage of fine apps with which to do it. DataGraph is available from the App Store, Igor Pro from WaveMetrics, and the statistical programming language R is free from here. If it gets you away from those horrid seven-day moving averages, so much the better.

11Comments

Add yours

1

Fred K on June 18, 2020 at 12:40 pm

I live in Alabama in the United States and I’ve found an incredibly well designed site, bamatracker.com. In my opinion, for the amount of information depicted, it’s incredibly intuitive and easy to navigate.

LikeLiked by 1 person
- 2
  
  hoakley on June 18, 2020 at 3:57 pm
  
  Thank you.
  That’s an excellent compilation, although it is limited by the data available, and those horrible seven-day moving averages again!
  Howard.
  
  LikeLike
3

Fred K on June 18, 2020 at 12:45 pm

I should add however that it may use methods that, as you explain, are not the best.

LikeLike
4

thoolb on June 19, 2020 at 9:35 pm

Just in case: You might want to look at the diagrams of https://ourworldindata.org/coronavirus-data
They offer various display options, e.g. toggle linear/logarithmic curves, toggle chart/map/table view, filter/order options in tables and download for data sources.

LikeLiked by 1 person
- 5
  
  hoakley on June 19, 2020 at 10:17 pm
  
  Thank you. Another impressive array of charts which fall victim to the problems which I detail above.
  The data there are taken from the same blurred daily figures which report positive cases on the day of the test report, not the day that the sample was taken. Interestingly, I’ve now found some high quality data for New York City which correctly attributes positive test results, and will be showing that here next week. Some is available, but it’s generally much harder to find than the same old figures reported to the ECDC and WHO, which are almost universally used.
  Where attempts are made to fit curves, those use moving averages which are plotted for the last day – this is done because it’s popular in economics, and because it’s computationally simpler. The fact that it doesn’t show trends or remove the hebdomadal cycles properly is ignored for the sake of expediency.
  It’s also perhaps worth noting that the UK test figures given here appear much lower than those officially reported, I suspect as a result of the poor reporting procedure used. That’s largely the UK’s fault, but I’d have hoped that the figures that are reported were better understood if they were going to be presented.
  Howard.
  
  LikeLike
  - 6
    
    thoolb on June 20, 2020 at 2:39 am
    
    You seem to have an objective perspective and presentation in mind. Besides the complexity of corona aspects, I rather think that any data comparison, regardless of data string, table or visual diagram, requires a more than two-dimensional interface to get closer to a complete or objective view – and then may not be ‘readable’ but require a potentially overwhelmed complex navigation tool on screen or as physical print object. Actually any comparison, even of two data sets only, most often represent a limiting reduction and does not take into account all aspects of the network in which it is embedded. For instance a geographical comparison of ‘plant height’ and ‘temperature’ may be mathematically correct but misleading without ‘humidity’, ‘light’ and ‘nutritions’. Or ‘maximal human age’ can lack of the aspect of felt, experienced time and speed, in any comparison. Finally, can data show the ‘true’ reason of a human’s death, or aren’t they a mathematically focused calculation only?
    
    LikeLiked by 1 person
    - 7
      
      hoakley on June 20, 2020 at 5:45 am
      
      No. I have in mind the fundamental rules for any data analysis.
      First, get the highest quality and most meaningful data you can. I’ve also been looking at those from Brazil, which are much harder to deal with for a host of reasons. When you know that the times associated with measurements are inaccurate, you need to seek better data. Unfortunately, almost everyone presenting charts for Covid-19 takes their data from ECDC/WHO, which are manifestly unsuitable. You don’t have to know much about it when you see countries reporting backlogs and corrections in single days – for example, one day when a country reported -500 or more fresh cases. This isn’t data collection, it’s accounting which aims to correct or fiddle the balance sheet.
      Next, explore your data. Look at random noise, short term cycles, long term trends. With time series like these, you can actually see things like the hebdomadal cycle without even going to spectral analysis. But no one seems to have done that before choosing how they’re going to present the data, so any meaning gets lost by inappropriate presentation methods.
      Finally, you leap ahead to making comparisons, which isn’t even on the horizon at this stage. And if you have been diligent about knowing your data thoroughly, you will immediately realise that making comparisons requires first and foremost comparable data. As each country gathers its data very differently, and testing policies aren’t even consistent over time (let alone between countries), you know that you’re going to have to be exceedingly careful about making comparisons. If any are valid.
      Howard.
      (I’ve only been doing this sort of work for 40 years now.)
      
      LikeLiked by 1 person
8

Bob on June 21, 2020 at 10:12 pm

Thanks for the article, Howard. I’ve been performing my own “armchair” graphing of the World-O-Meter data and there is quite a bit of variance. For new case rate, I have resigned my self to a rolling average of four days for smoothing, and an understanding that I cannot look at this data over a short time period (7 days) and expect it to be meaningful. I then use linear least squares approximation over a period of time to boil it down to one number: a slope. The time period in question provides the context. I have found this to be meaningful, but perhaps I’m blowing statistical smoke in my own face.

For mortalty rate, the media is fond of using the calculation of number of deaths divided by number of cases. This value has been nonsensical, especially early on, given that active cases haven’t resolved, yet they’re a part of the equation; a weird kind of recursive prognostication resulting in an artificially low mortality rate, perhaps to avoid panic. I prefer computing mortality rate as the number of deaths divided by the number of resolved cases, where “resolved” means either death, or cured. After all, we don’t know the outcome of unresolved cases. I started tracking these numbers on 3/15 and it was shockingly high for the U.S.: 40%. It is now down to about 11%, and 9% globally. At first, I attributed this initially high rate and slow descent to the death of at-risk victims with pre-existing aggravating conditions. In other words, the most at-risk fell victim first. But then I reasoned that given a population with a spectrum of health conditions from gravely ill to the picture of health, statistically speaking, infections should occur evenly across this spectrum. Therefore this wouldn’t account for the high-to-low trend. So I don’t know what to think.

LikeLiked by 2 people
- 9
  
  hoakley on June 21, 2020 at 10:37 pm
  
  Thank you.
  I have avoided looking at deaths completely, because they’re so delayed and actually more complex.
  Beware of figures for those who have ‘recovered’ – they’re the least reliable of all, as most don’t correlate well with the number of positive cases, implying that a lot of cases just get lost somewhere.
  I’ll be looking more at fresh cases in the coming week, using data from New York City and Brazil, in addition to England.
  Howard.
  
  LikeLiked by 1 person
10

Leonardo on June 22, 2020 at 7:15 am

Dear Howard Oakley, your chart showing the raw data as red points, and the LOESS curve in black is beautiful. I would add another dimension to this chart in form of a band who’s vertical axis depicts the standard deviations or the confidence interval.
The band could be positioned below the points/curve and would show us at each point in time how well the curve fits the points and at which times we have a more regular development of events vs the parts where the course changes.
This information is already visible in your chart but you have to take a rather close look to estimate the order of changes/deviation whereas the suggested band would tell you immediately.
Leonardo

LikeLiked by 2 people
- 11
  
  hoakley on June 22, 2020 at 3:28 pm
  
  Thank you, Leonardo.
  There are two good reasons for not trying to show error bars of confidence intervals, apart from their effect on readability.
  First, error bars are most valuable when the points plotted are averages (or other estimates), to show dispersion about that estimator. In this case, the points are individual data points, therefore there’s no estimate of individual error available.
  Second, there’s no confidence interval involved. Normally in line fitting, you’re dealing with two underlying issues, the ‘real’ or underlying data and error which is added/multiplied to it. In this case, that’s not what the smooth curve represents. It’s one part of a time series decomposition, which tries to remove error, short-term cycles such as the marked hebdomadal cycle which is of high magnitude, and any other short-term effects which may not be so visible in the raw data.
  The hebdomadal cycle isn’t error, not is it anything to do with confidence intervals, but an effect other than the trend which you’re trying to expose. You can perform spectral analysis to reconstruct that cycle, and could conceivably add that to the trend and then establish an estimate of variation from that combination, but that’s going more than an extra mile. What is important, and established in the residual analysis, is that cycles, error and anything else ‘removed’ to produce the smooth trend aren’t biased. Their magnitude isn’t actually relevant to the plot of trend, which is what confidence intervals try to indicate.
  Howard.
  
  LikeLiked by 1 person

·Comments are closed.

Share this:

Related