The Low-Down: Wrong Tools,Wrong Box? Analysis of Big Data Suffers From Outmoded Statistical and Interpretive Methodologies

Big data presents managers with a fantastic opportunity to better understand and even anticipate customers' desires, as well as evaluate their own organization's abilities to address those aspirations.

There's just one problem, as the following article explains: the kinds of data being collected by a new generation of devices about an even newer range of issues is ill-suited to interpretation by existing statistical and data-mining methodologies.

Part of the issue is volume: we are unaccustomed to the range, depth and speed of information flowing into institutional databases. Most organizations are trained and equipped to ask a specific set of questions based on established methodologies designed for a time when events and collection moved at a more majestic, which is to say slower, pace. A time when longitudinal series were collected over time. When we had time to contemplate.

But when was the last time anyone you know believed they had that sort of license? Indeed.

Eventually, methods will catch up with need. In the interim, it is worth rethinking the sort of questions being asked. This also requires reimagining what sort of outcomes could be possible rather than attempting to stuff assumptions through the prism of historical experience. It may now be necessary to actually invest some time in re-evaluating what is possible - and how to get wherever that leads. The future is developing, the urgent question is whether managers are capable of recognizing it. JL

Scott Thurm reports in the Wall Street Journal:

Opportunistic data collection is leading to entirely new kinds of data that aren’t well suited to the existing statistical and data-mining methodologies.
Companies are getting a flood of new data from customers, vendors and the factory floor. What are some ways to think about big data—and to reduce uncertainty when building a business strategy around it?
Amy Braverman, principal statistician at the Jet Propulsion Laboratory, comes up with ways to analyze and make accessible troves of information from NASA’s space-borne instruments.
Here are edited excerpts of their conversation.

Making Sense of It All
MR. THURM: What do you do when you don’t have all the data, or it’s not all neatly organized?
DR. BRAVERMAN: Two things. Data collection is so different than it used to be. Spacecraft are collecting information on thousands and thousands of variables about the health of the Earth and the atmosphere. Freeways have built-in sensors, so you can go to your smartphone and see how many people are getting on at your on ramp and getting off. The supermarket, every time you scan something, it’s going into a database.This opportunistic data collection is leading to entirely new kinds of data that aren’t well suited to the existing statistical and data-mining methodologies. So point number one is that you need to think hard about the questions that you have and about the way that the data were collected and build new statistical tools to answer those questions. Don’t expect the existing software package that you have is going to give you the tools you need.

Point number two is having to deal with distributed data. There’s been a lot of work done on distributed computing, where you have a computation, and you want to farm the computation out in pieces, so that you can get it done faster. What do you do when the data that you want to analyze are actually in different places?
There’s lots of clever solutions for doing that. But at some point, the volume of data’s going to outstrip the ability to do that. You’re forced to think about how you might, for example, reduce those data sets, so that they’re easier to move.

What We Don’t Know
MR. THURM: One of the projects that you work on is dealing with satellite capture data of CO2 concentrations in the atmosphere.
DR. BRAVERMAN: The satellites we work with are in polar orbit, which means they are flying from north to south and then back up the backside of the Earth. A satellite that’s in polar orbit with the Earth turning underneath it is always going to cross the equator at the same time every day. I think for CO2, it’s 1:30 in the afternoon.
If we want to draw conclusions about the global distribution of carbon dioxide, we have to know that the data we’re collecting pertains to 1:30 in the afternoon. Plants breathe. And they exhale. Typically, they’re doing photosynthesis during the day. At 1:30 in the afternoon, the plants are going to be sucking up a lot of carbon dioxide.
It would be a mistake to draw a conclusion about the global distribution of carbon dioxide, based on those data, for all times of day.
You have to be aware of the biases that may be imparted to the data that you have, relative to the data you wish you had.

Point number two is having to deal with distributed data. There’s been a lot of work done on distributed computing, where you have a computation, and you want to farm the computation out in pieces, so that you can get it done faster. What do you do when the data that you want to analyze are actually in different places?
There’s lots of clever solutions for doing that. But at some point, the volume of data’s going to outstrip the ability to do that. You’re forced to think about how you might, for example, reduce those data sets, so that they’re easier to move.

What We Don’t Know
MR. THURM: One of the projects that you work on is dealing with satellite capture data of CO2 concentrations in the atmosphere.
DR. BRAVERMAN: The satellites we work with are in polar orbit, which means they are flying from north to south and then back up the backside of the Earth. A satellite that’s in polar orbit with the Earth turning underneath it is always going to cross the equator at the same time every day. I think for CO2, it’s 1:30 in the afternoon.
If we want to draw conclusions about the global distribution of carbon dioxide, we have to know that the data we’re collecting pertains to 1:30 in the afternoon. Plants breathe. And they exhale. Typically, they’re doing photosynthesis during the day. At 1:30 in the afternoon, the plants are going to be sucking up a lot of carbon dioxide.
It would be a mistake to draw a conclusion about the global distribution of carbon dioxide, based on those data, for all times of day.
You have to be aware of the biases that may be imparted to the data that you have, relative to the data you wish you had.