Sunday, June 14, 2015

The limits of Big Data

Original post:  Apr 15, 2014

Today's post comes from an article on big data in the New York Times:  http://www.nytimes.com/2014/04/07/opinion/eight-no-nine-problems-with-big-data.html?_r=0

I am an enthusiastic supporter of the power of big data. On the one hand, I truly believe that there is significant power in the proper application of statistics to help provide the supporting data for business decisions. On the other hand, there are many times when "big data" just seems like another buzzword that is used as a substitute for the "and then a miracle happens" magical thinking in certain unwieldy process maps!

What attracted me to this article was the way that it tempers the great expectations often applied to big data. It's important to note some of its limitations in order to grasp its power more fully.

Here is one example:

The first thing to note is that although big data is very good at detecting correlations, especially subtle correlations that an analysis of smaller data sets might miss, it never tells us which correlations are meaningful. A big data analysis might reveal, for instance, that from 2006 to 2011 the United States murder rate was well correlated with the market share of Internet Explorer: Both went down sharply. But it’s hard to imagine there is any causal relationship between the two. Likewise, from 1998 to 2007 the number of new cases of autism diagnosed was extremely well correlated with sales of organic food (both went up sharply), but identifying the correlation won’t by itself tell us whether diet has anything to do with autism.

Another critical point:

A sixth worry is the risk of too many correlations. If you look 100 times for correlations between two variables, you risk finding, purely by chance, about five bogus correlations that appear statistically significant — even though there is no actual meaningful connection between the variables. Absent careful supervision, the magnitudes of big data can greatly amplify such errors.

In my opinion, the most important part of big data is having someone with the ability to look at the various correlations and draw meaningful conclusions from them that can make a significant business impact. Knowing that Netflix users really like sophisticated drama and Kevin Spacey means nothing until you make the $100 million investment to create the US version of "House of Cards"!

No comments:

Post a Comment