A Blog by Jonathan Low

 

Apr 18, 2014

It's Not About Fixing the Algorithm: Big Data and the Real World

Call it context or situational awareness or experience, but the crucial ingredient in making big data big probably relies more on pulling your head out of the computer screen than on aggregating all the right numbers.

A particularly popular example of this lies in the review of Google Flu Trends. Established with all of the usual societal benefits in mind - and as a subtle way of reinforcing the underlying value of both the data and the company that generated it - recent evaluations have established that it consistently overestimated the incidence of flu.

Adherents argue that this is a result of newness and that as the data set becomes more longitudinal - in other words, as we have more experience - the algorithm can be tweaked and the accuracy improved. Others maintain, however, that this has more to do with assumptions built into the model than with any deficiencies in the data itself.

We live in a world in which more and more decisions of greater import have to be made with less information in shorter periods of time. The perfect is the enemy of the good. We have to learn to interpret using our instincts, experience and common sense, not wait for the ultimate data set. JL

Mikkel Krenchel and Christian Madsbjerg report in Wired:

Big data is nothing without “thick data,” the rich and contextualized information you gather only by getting up from the computer and venturing out into the real world.
In a generation, the relationship between the “tech genius” and society has been transformed: from shut-in to savior, from antisocial to society’s best hope. Many now seem convinced that the best way to make sense of our world is by sitting behind a screen analyzing the vast troves of information we call “big data.”
Just look at Google Flu Trends. When it was launched in 2008 many in Silicon Valley touted it as yet another sign that big data would soon make conventional analytics obsolete.
But they were wrong.
If the big-data evangelists of Silicon Valley really want to “understand the world” they need to capture both its (big) quantities and its (thick) qualities.
Not only did Google Flu Trends largely fail to provide an accurate picture of the spread of influenza, it will never live up to the dreams of the big-data evangelists. Because big data is nothing without “thick data,” the rich and contextualized information you gather only by getting up from the computer and venturing out into the real world. Computer nerds were once ridiculed for their social ineptitude and told to “get out more.” The truth is, if big data’s biggest believers actually want to understand the world they are helping to shape, they really need to do just that.

It Is Not About Fixing the Algorithm

The dream of Google Flu Trends was that by identifying the words people tend to search for during flu season, and then tracking when those same words peaked in the real time, Google would be able alert us to new flu pandemics much faster than the official CDC statistics, which generally lag by about two weeks.
Screen Shot 2014-04-10 at 2.33.09 PM
For many, Google Flu Trends became the poster child for the power of big data. In their best-selling book Big data: A Revolution That Will Transform How We Live, Work and Think, Viktor Mayer-Schönberger and Kenneth Cukier claimed that Google Flu Trends was “a more useful and timely indicator [of flu] than government statistics with their natural reporting lags.” Why even bother checking the actual statistics of people getting sick, when we know what correlates to sickness? “Causality,” they wrote, “won’t be discarded, but it is being knocked off its pedestal as the primary fountain of meaning.”
But, as an article in Science earlier this month made clear, Google Flu Trends has systematically overestimated the prevalence of flu every single week since August 2011.

And back in 2009, shortly after launch, it completely missed the swine flu pandemic. It turns out, many of the words people search for during Flu season have nothing to do with Flu, and everything to do with the time of year flu season usually falls: winter. 

Now, it is easy to argue – as many have done – that the failure of Google Flu Trends simply speaks to the immaturity of big data. But that misses the point. Sure, tweaking the algorithms, and improving data collection techniques will likely make the next generation of big data tools more effective. But the real big data hubris is not that we have too much confidence in a set of algorithms and methods that aren’t quite there yet. Rather, the issue is the blind belief that sitting behind a computer screen crunching numbers will ever be enough to understand the full extent of the world around us.

Why Big Data Needs Thick Data

Big data is really just a big collection of what people in the humanities would call thin data. Thin data is the sort of data you get when you look at the traces of our actions and behaviors. We travel this much every day; we search for that on the Internet; we sleep this many hours; we have so many connections; we listen to this type of music, and so forth. It’s the data gathered by the cookies in your browser, the FitBit on your wrist, or the GPS in your phone. These properties of human behavior are undoubtedly important, but they are not the whole story.
To really understand people, we must also understand the aspects of our experience — what anthropologists refer to as thick data. Thick data captures not just facts but the context of facts. Eighty-six percent of households in America drink more than six quarts of milk per week, for example, but why do they drink milk? And what is it like? A piece of fabric with stars and stripes in three colors is thin data. An American Flag blowing proudly in the wind is thick data.
A piece of fabric with stars and stripes in three colors is thin data. An American Flag blowing proudly in the wind is thick data.
Rather than seeking to understand us simply based on what we do as in the case of big data, thick data seeks to understand us in terms of how we relate to the many different worlds we inhabit. Only by understanding our worlds can anyone really understand “the world” as a whole, which is precisely what companies like Google and Facebook say they want to do.

Knowing the World Through Ones and Zeroes

Consider for a moment, the grandiosity of some of the claims being made in Silicon Valley right now. Google’s mission statement is famously to ”organize the world’s information and make it universally accessible and useful.” Mark Zuckerberg recently told investors that, along with prioritizing increased connectivity across the globe and emphasizing a knowledge economy, Facebook was committed to a new vision called “understanding the world.” He described what this “understanding” would soon look like: “Every day, people post billions of pieces of content and connections into the graph [Facebook’s algorithmic search mechanism] and in doing this, they’re helping to build the clearest model of everything there is to know in the world.” Even smaller companies share in the pursuit of understanding. Last year, Jeremiah Robison, the VP of Software at Jawbone, explained that the goal with their Fitness Tracking device Jawbone UP was “to understand the science of behavior change.”
These goals are as big as the data that is supposed to achieve them. And it is no wonder that businesses yearn for a better understanding of society. After all, information about customer behavior and culture at large is not only essential to making sure you stay relevant as a company, it is also increasingly a currency that in the knowledge economy can be traded for clicks, views, advertising dollars or simply, power. If in the process, businesses like Google and Facebook can contribute to growing our collective knowledge about of ourselves, all the more power to them. The issue is that by claiming that computers will ever organize all our data, or provide us with a full understanding of the flu, or fitness, or social connections, or anything else for that matter, they radically reduce what data and understanding means.
By claiming that computers will ever organize all our data, or provide us with a full understanding of the flu, or fitness, or social connections, or anything else for that matter, they radically reduce what data and understanding means.
If the big data evangelists of Silicon Valley really want to “understand the world” they need to capture both its (big) quantities and its (thick) qualities. Unfortunately, gathering the latter requires that instead of just ‘seeing the world through Google Glass’ (or in the case of Facebook, Virtual Reality) they leave the computers behind and experience the world first hand. There are two key reasons why.

To Understand People, You Need to Understand Their Context

Thin data is most useful when you have a high degree of familiarity with an area, and thus have the ability to fill in the gaps and imagine why people might have behaved or reacted like they did — when you can imagine and reconstruct the context within which the observed behavior makes sense. Without knowing the context, it is impossible to infer any kind of causality and understand why people do what they do.
This is why, in scientific experiments, researchers go to great lengths to control the context of the laboratory environment –- to create an artificial place where all influences can be accounted for. But the real world is not a lab. The only way to make sure you understand the context of an unfamiliar world is to be physically present yourself to observe, internalize, and interpret everything that is going on.

Most of ‘the World’ Is Background Knowledge We Are Not Aware of

If big data excels at measuring actions, it fails at understanding people’s background knowledge of everyday things. How do I know how much toothpaste to use on my toothbrush, or when to merge into a traffic lane, or that a wink means “this is funny” and not “I have something stuck in my eye”? These are the internalized skills, automatic behaviors, and implicit understandings that govern most of what we do. It is a background of knowledge that is invisible to ourselves as well as those around us unless they are actively looking. Yet it has tremendous impact on why individuals behave as they do. It explains how things are relevant and meaningful to us.
The human and social sciences contain a large array of methods for capturing and making sense of people, their context, and their background knowledge, and they all have one thing in common: they require that the researchers immerse themselves in the messy reality of real life.
No single tool is likely to provide a silver bullet to human understanding. Despite the many wonderful innovations developed in Silicon Valley, there are limits to what we should expect from any digital technology. The real lesson of Google Flu Trends is that it simply isn’t enough to ask how ‘big’ the data is: we also need to ask how ‘thick’ it is.
Sometimes, it is just better to be there in real life. Sometimes, we have to leave the computer behind.

0 comments:

Post a Comment