The Low-Down: How Recency Bias May Be Corrupting Big Data Validity

Data validity and data quality will determine not just which enterprises most effectively apply what they've learned, but whether and how big data retains its efficacy. JL

Tom Chatfield reports in the BBC:

One of the problems with (the) rate of information increase is that the present will always loom far larger than even the recent past. Much of the world’s data increase is due to more sources of information being created by more people, along with far larger and more detailed formats. The softer the science the more that scale inversely correlates with quality – and the more important time becomes as a filter. Either we choose carefully what endures and meaningfully captures our receding past – or its imprint is silently supplanted by the present’s growing noise.
You may be familiar with the statistic that 90% of the world’s data was created in the last few years. It’s true. One of the first mentions of this particular formulation I can find dates back to May 2013, but the trend remains remarkably constant. Indeed, every two years for about the last three decades the amount of data in the world has increased by about 10 times – a rate that puts even Moore’s law of doubling processor power to shame.
One of the problems with such a rate of information increase is that the present moment will always loom far larger than even the recent past. Imagine looking back over a photo album representing the first 18 years of your life, from birth to adulthood. Let’s say that you have two photos for your first two years. Assuming a rate of information increase matching that of the world’s data, you will have an impressive 2,000 photos representing the years six to eight; 200,000 for the years 10 to 12; and a staggering 200,000,000 for the years 16 to 18. That’s more than three photographs for every single second of those final two years.

The moment you start looking backwards to seek the longer view, you have far too much of the recent stuff and far too little of the old
This isn’t a perfect analogy with global data, of course. For a start, much of the world’s data increase is due to more sources of information being created by more people, along with far larger and more detailed formats. But the point about proportionality stands. If you were to look back over a record like the one above, or try to analyse it, the more distant past would shrivel into meaningless insignificance. How could it not, with so many times less information available?
Here’s the problem with much of the big data currently being gathered and analysed. The moment you start looking backwards to seek the longer view, you have far too much of the recent stuff and far too little of the old. Short-sightedness is built into the structure, in the form of an overwhelming tendency to over-estimate short-term trends at the expense of history.

To understand why this matters, consider the findings from social science about ‘recency bias’, which describes the tendency to assume that future events will closely resemble recent experience. It’s a version of what is also known as the availability heuristic: the tendency to base your thinking disproportionately on whatever comes most easily to mind. It’s also a universal psychological attribute. If the last few years have seen exceptionally cold summers where you live, for example, you might be tempted to state that summers are getting colder – or that your local climate may be cooling. In fact, you shouldn’t read anything whatsoever into the data. You would need to take a far, far longer view to learn anything meaningful about climate trends. In the short term, you’d be best not speculating at all – but who among us can manage that?

Short-term analyses aren’t only invalid – they’re actively unhelpful and misleading
The same tends to be true of most complex phenomena in real life: stock markets, economies, the success or failure of companies, war and peace, relationships, the rise and fall of empires. Short-term analyses aren’t only invalid – they’re actively unhelpful and misleading. Just look at the legions of economists who lined up to pronounce events like the 2009 financial crisis unthinkable right until it happened. The very notion that valid predictions could be made on that kind of scale was itself part of the problem.
It’s also worth remembering that novelty tends to be a dominant consideration when deciding what data to keep or delete. Out with the old and in with the new: that’s the digital trend in a world where search algorithms are intrinsically biased towards freshness, and where so-called link rot infests everything from Supreme Court decisions to entire social media services. A bias towards the present is structurally engrained in almost all the technology surrounding us, not least thanks to our habit of ditching most of our once-shiny machines after about five years.
What to do? This isn’t just a question of being better at preserving old data – although this wouldn’t be a bad idea, given just how little is currently able to last decades rather than years. More importantly, it’s about determining what is worth preserving in the first place – and what it means meaningfully to cull information in the name of knowledge.

We need to be better at determining what data is worth preserving in the first place (Credit: iStock)

What’s needed is something that I like to think of as “intelligent forgetting”: teaching our tools to become better at letting go of the immediate past in order to keep its larger continuities in view. It’s an act of curation akin to organising a photograph album – albeit with more maths. When are two million photographs less valuable than two thousand? When the larger sample covers less ground; when the questions that can be asked of it are less important; when the level of detail on offer instils not useful scepticism, but false confidence.
Many data sets are irreducible and most precious when complete: gene sequences; demographic data; the raw, hard knowledge of geography and physics. The softer the science, however, the more that scale is likely inversely to correlate with quality – and the more important time itself becomes as a filter. Either we choose carefully what endures, matters and meaningfully captures our receding past – or its imprint is silently supplanted by the present’s growing noise.
Time cuts several ways, for there is another crucial sense in which it remains a limiting factor: the availability of human time and attention. Corporations, individuals and governments alike have orders of magnitude more information available today than they did even a few years ago. Yet they don’t have any more available attention, board members, chief executives, elected officials or hours in the day. Better and better tools exist to help decision-makers ask meaningful questions of the information they possess – but you can only analyse what remains accessible. Mere accumulation is no kind of answer. In an era of bigger and bigger data, what you choose not to know matters just as much as what you do.

A Blog by Jonathan Low

Jun 10, 2016

How Recency Bias May Be Corrupting Big Data Validity

0 comments:

Post a Comment

contact

Search This Blog

Blog Archive

Labels

links