Should the rules and targets we set up be precise, clear and sophisticated? Or should they be vague, ambiguous and crude? I used to think that the answer was obvious — who would favour ambiguity over clarity? Now I am not so sure.
Ponder the scandal that engulfed Volkswagen in late 2015, when it emerged that the company had been cheating on US emissions tests. What made such cheating possible was the fact that the tests were absurdly predictable — a series of pre-determined manoeuvres on a treadmill. VW’s vehicles, kitted out with sensors as all modern cars are, were programmed to recognise the choreography of a laboratory test and switch to special testing mode — one that made the engine sluggish and thirsty, but that filtered out pollutants.
The trick was revealed when a non-profit group strapped emissions monitors to VW cars and drove them from San Diego to Seattle. In some ways, that’s a crude test: outside the laboratory, no two journeys can be compared precisely. But the cruder test was also the test that revealed the duplicity.
The VW case seems like a strange one-off. It isn’t. Consider the “stress tests” applied by regulators to large banks. These stress tests are disaster scenarios in which a bank calculates what would happen in particular gloomy situations. But, in 2014, US regulators started to notice that banks had made very specific, narrow bets designed to pay off gloriously in specific stress-test scenarios. There is no commercial logic for these bets — but they certainly make it easier to pass the stress test. VW all over again — with the difference that what the banks were doing was apparently perfectly legal.
If tests and targets can fail because they are too predictable, they can also fail because they are too narrow. A few years ago, UK ambulance services were set a target to respond to life-threatening situations within eight minutes of receiving an emergency call. Managers soon realised that they could hit the target more easily if they replaced a two-person ambulance with an independent pair of paramedics on bikes. And many responses were written down as seven minutes and 59 seconds, but few as eight minutes and one second — suspiciously timely work.
Perhaps we’d be better off handing over the problem to computers. Armed with a large data set, the computer can figure out who deserves to be rewarded or punished. This is a fashionable idea. As Cathy O’Neil describes in her recent book, Weapons of Math Destruction (UK) (US), such data-driven algorithms are being used to identify which prisoners receive parole and which teachers are sacked for incompetence.
These algorithms aren’t transparent — they’re black boxes, immune from direct scrutiny. The advantage of that is that they can be harder to outwit. But that does not necessarily mean they work well. Consider the accuracy of the recommendations that a website such as Amazon serves up. Sometimes these suggestions are pretty good, but not always. At the moment, Amazon is recommending that I buy a brand of baby bottle cleanser. I’ve no idea why, since all my children are of school age.
A teacher-firing algorithm might look at student test scores at the beginning and end of each school year. If the scores stagnate, the teacher is presumed to be responsible. It’s easy to see how such algorithms can backfire. Partly, the data are noisy. In a data set of 300,000, analysts can pinpoint patterns with great confidence. But with a class of 30, a bit of bad luck can cost a teacher his or her job. And perhaps it isn’t bad luck at all: if the previous year’s teacher somehow managed to fix the test results (it happens), then the new teacher will inherit an impossible benchmark from which to improve.
Just like humans, algorithms aren’t perfect. Amazon’s “you might want to buy bottle cleanser” is not a serious error. “You’re fired” might be, which means we need some kind of oversight or appeal process if imperfect algorithms are to make consequential decisions.
Even if an algorithm flawlessly linked a teacher’s actions to the students’ test scores, we should still use it with caution. We rely on teachers to do many things for the students in their class, not just boost their test scores. Rewarding teachers too tightly for test scores encourages them to neglect everything we value but cannot measure.
The economists Oliver Hart and Bengt Holmström have been exploring this sort of territory for decades, and were awarded the 2016 Nobel Memorial Prize in Economics for their pains. But, all too often, politicians, regulators and managers ignore well-established lessons.
In fairness, there often are no simple answers. In the case of VW, transparency was the enemy: regulators should have been vaguer about the emissions test to prevent cheating. But in the case of teachers, more transparency rather than less would help to uncover problems in the teacher evaluation algorithm.
Sometimes algorithms are too simplistic, but on occasions simple rules can work brilliantly. The psychologist Gerd Gigerenzer has assembled a large collection of rules of thumb that perform very well in predicting anything from avalanches to heart attacks. The truth is that the world can be a messy place. When our response is a tidy structure of targets and checkboxes, the problems really begin.
Written for and first published in the Financial Times.