Making subjective reviews more objective

With a steady stream of new smart phones, each claiming a new and better camera, we are getting used to reading reviews which try to compare images produced by different in-phone cameras.

Although many are produced diligently, few recognise the problems inherent in making such comparisons, and I am afraid most (perhaps all) fall into the trap of bias in making such subjective decisions. No matter how hard reviewers try to distance themselves from bias, unless they perform comparisons blind (metaphorically, of course), their judgements cannot be trusted.

The good medical trial

Evaluating the efficacy of medicines is a far more critical task, and the only thoroughly robust form of clinical trial is one which is randomised, and blinded to both the patients and observers. The way this is normally done is for an independent organisation to prepare apparently identical medication packs for each patient, bearing unique codes. That organisation has a key which shows which codes are being given the medication being evaluated, and which the placebo (or other comparison).

Neither the patients nor any of the staff involved in treating the patients (or running the trial) know who is being treated with what. Only at the very end, when the last patient has completed, are the codes ‘broken’, so that the data can be analysed. Similar techniques of ‘blinding’ are in widespread use in most areas of research, to eliminate observer bias.

Audio compression

When lossy audio compression, including MP3, was first starting to become popular, I had to write an appraisal. Recognising the dangers of bias, I set up a series of ‘blind’ listening tests, in which I and others listened to tracks which had not been compressed, and those which had been compressed using MP3 and then decompressed to play.

It was not hard to do: each track was given a long numeric name which was almost identical to all the others, and did not reveal whether it had been compressed. Evaluators rated comparisons in pairs, usually, and in the great majority of cases it quickly became apparent that they were unable to distinguish tracks which had been passed through the MP3 codec, unless the compression ratio was high.

The only persistent exception to this was with Indonesian gamelan music, with its rich tintinnabulation; that required much lower compression ratios or the instruments appeared badly distorted. Whilst I was about it, I also looked at whether my listeners could detect any consistent differences in the uncompressed sample rate (44.1, 48, or 96 kHz). Despite my own personal insistence that I could hear differences at 96 kHz, no one actually could, reliably.

Since then, more people have used blind listening tests to good effect. Most recently, for instance, such methods have established that paying huge sums for extreme high quality digital audio leads is a complete waste of money, as no human can detect any difference in audio quality.

Comparing cameras

There are even greater problems when you try to review or compare any form of digital camera. Even when you are able to save images in some form of ‘raw’ format, image files are processed very extensively before you get to see them. In many cases, any improvements apparent in a newer camera could be attributable not to any change in the camera itself, but in the firmware and software used to process its images. Thus if a similar update is applied to the older camera (something which the manufacturer could initially withhold to bolster new product sales), any difference between the old and new could vanish.

Good comparisons between cameras take near-identical shots from each, and the reviewer then compares the images for resolution, sharpness, and other properties deemed to be important.

What the reviewer should be doing is getting someone else to give each image file a number, and to keep a record of which numbers come from which camera. The order of images should change at random: this can be done using a generated set of random digits, with each digit determining whether the new camera’s image is given the odd or even file number, for instance.

The viewer(s) should then only see the file numbers, against which they will express their preference. If most or all of the preferred images came from the new camera, then the reviewer is justified in reporting that the new camera was preferred. If preferences were roughly 50:50, then there does not appear to be any discernible difference. There are formal statistical analyses which could be used, but in general they require large numbers of such comparisons, and are not worth the candle for this sort of review.

There is another lesson which I learned from the gamelan music: although it is important to test products under review doing the things which most users do much of the time, it is also important to perform ‘stress testing’ with images which more commonly produce poor results. They could usefully include those lit over a very high dynamic range, and bright ‘dayglo’ colours which can bleed, for example.

Suggestion

I would not dare denigrate the many thorough and diligent reviews of in-phone cameras, or other subjective comparisons. However I think that there is scope for the elimination of observer bias. It is not hard to accomplish, and I for one would be greatly reassured to read that such measures had been taken by reviewers. I think that most reviewers would also feel happier in doing so.