We live in a data-rich world. It’s natural to think that the more data we have, the easier it will be to do business. But big data has its pitfalls.
Treat it right and it’s an immense asset – misuse it and it can ruin reputations.
Elizabethan politician and philosopher Francis Bacon said “Nam et ipsa Scientia potestas est,” which neatly illustrates the benefits and risks involved. If you don’t read Latin, the data contained in that sentence is valueless. We need to know how to make proper use of that data to translate the sentence as “Knowledge itself is power.”
When you’re working with big data it’s essential to get the big ‘A’s that determine effective usage right – algorithms and assumptions.
A big data system is only as good as the algorithms used to access and manage that data. And the designers of those algorithms need to make accurate assumptions about the users of the systems and the deductions that can be drawn.
Algorithms are sets of rules that manipulate, analyse and respond to data. A great illustration of getting them wrong was the Flash Crash of 2010. In 36 minutes, starting at 2.32 in the afternoon New York time, over a trillion dollars was wiped off the value of US stocks.
To blame were a collection of algorithms, programmed to buy and sell in reaction to stock market data. The algorithms primarily belonged to high-frequency traders. Some of these algorithms were badly written. This made it possible for sales during one minute to be based on a percentage of sales in the previous minute, providing a feedback loop that spiralled out of control.
Big data systems can act far quicker than human beings, which is why in the 36 minutes that it took traders to work out what was happening and to pull the plug on the algorithms, all hell broke loose.
This was a costly example of the rule that big data is only as good as the algorithms that handle it. Make a mistake, and the high speed and repetitive nature of algorithms means that they can do a lot of damage before effects are noticed.
However well-written the algorithms are, though, they are only as good as the assumptions made about the data. For example, the writers of a performance management system used in some US schools assumed that improvements in student grades over the year were an acceptable measure of the quality of teaching. But there was no evidence that staff performance could be measured this way – it was simply a case of using what was easy to measure, rather than what was effective.
To make matters worse, because the system was based on comparing current grades with those at the end of the previous year, when the students had different teachers – and quite possibly were at a different school – the algorithm was susceptible to errors in, or manipulation of, that data.
In one case, the same teacher scored six per cent one year and 96 per cent the next. He had done nothing different, but the algorithm conjured up a wildly dissimilar result, showing that its score bore very little resemblance to what it was supposed to measure: the quality of the teaching.
This algorithm used a proxy – a measure the designers assumed would reflect what the system was supposed to quantify, though in practice it didn’t.
The most dangerous algorithms generate their own proxies. Such systems pull in all available data and build an internal model that reflects what has happened so far. But there need not be any logic in the measures used. The algorithm could decide, for example, that staff members whose names start with C deserve 10 per cent higher rewards than anyone else, if past data coincidentally supported this. And because such a big data system is a black box where not even the developers understand how the conclusions are reached, no one can challenge the outcome or correct for such errors.
These problems are not insuperable, and are well worth overcoming for the huge benefits big data techniques can bring.
But getting the big ‘A’s right does mean having more transparency than is currently common.
We need to make the assumptions and the workings of the algorithm available for inspection and correction if the public is to have confidence in big data – and if business is to get most benefit from it.