Nothing Either Good or Bad: statistics

Showing posts with label statistics. Show all posts

Monday, July 18, 2016

Nine point guide to spotting dodgy stats

The Guardian published a deeper examination into the many ways facts and figures are twisted and distorted--sometimes beyond recognition. Statistics can be powerful support for an argument, but it is also important to look past the headline and gain greater context before making your final decision.

An early paragraph helps explain their rationale:

Every statistician is familiar with the tedious “Lies, damned lies, and statistics” gibe, but the economist, writer and presenter of Radio 4’s More or Less, Tim Harford, has identified the habit of some politicians as not so much lying – to lie means having some knowledge of the truth – as “bullshitting”: a carefree disregard of whether the number is appropriate or not.

Guardian: Our nine-point guide to spotting a dodgy statistic

Thursday, May 5, 2016

Math Stumps Your Doctor

In Bloomberg View, there is an article discussing the changing nature of healthcare. Doctors are now being presented with complex statistics that require mathematical skills to parse. Here is an example:

Take the famous hypothetical example of a test that is 95 percent accurate for a disease that affects 0.1 percent of the population. Imagine you’re a doctor and your patient tests positive. What is the chance that she has the disease? Most people’s intuitive answer is a rather dire 95 percent. This is wrong in a big way. Despite the ominous test result, the patient is unlikely to be sick.

“Even doctors and medical students are prone to this error,” wrote Aron Barbey, a cognitive neuroscientist at the University of Illinois, in a paper on risk literacy published last month in the journal Science.

Some people do get the right answer: that the patient has about a 2 percent chance of having the disease.

To be honest, I did not know that answer. The article goes on to provide one way to illustrate how they came up with the answer. Once I read that, I started to see how that made logical sense.

But there’s also an intuitive approach that requires no formula at all. Imagine 1,000 people getting the test. On average, one will have the disease. The 5 percent error rate means that about 50 of the 999 healthy people will test positive. Now it’s easy to see that the group of false positives is about 50 times bigger than the group of real positives. In other words, just 2 percent of the people testing positive are likely to be sick.

I found the discussion of inside versus outside views interesting.

In an interview, Barbey said that when dealing with conditional probabilities, people often make the mistake of focusing on just the population statistics (what he calls the outside view) or just the patient’s individual statistics (the inside view). In the ALS story, the doctor saw only the outside view, focusing on the low rate of the disease in the whole population. In the problem of the test that's 95-percent accurate, people often take the inside view, ignoring the rarity of the disease.

Here is the link to the full article: http://www.bloombergview.com/articles/2016-05-05/math-stumps-your-doctor-too

Tuesday, January 26, 2016

The difference between data science and scientists

Original post: Oct 15, 2015

Some of us really enjoy working with data. Others might have special names for these individuals, but we soldier on, nonetheless.

With the advent of big data and a proliferation of sites dedicated to the art and science of parsing data, there are interesting new career paths opening up. One such category is "data science". Practitioners of these dark arts are called "data scientists". But what does that mean?

This graphic might help explain the concept a little more clearly:

It's taken from this article: http://priceonomics.com/whats-the-difference-between-data-science-and/

Here is that article's definition of statistics compared to data science:

Statistics was primarily developed to help people deal with pre-computer data problems like testing the impact of fertilizer in agriculture, or figuring out the accuracy of an estimate from a small sample. Data science emphasizes the data problems of the 21st Century, like accessing information from large databases, writing code to manipulate data, and visualizing data.

Why was there a specialized need for such persons? Part of it grew out of the computers that humans developed to deal with massive amounts of data:

Several factors prompted these innovations: First, people needed to work with datasets, which we now call big data, that are larger than pre-computational statisticians could have imagined. Second, industry focused increasingly on making predictions about markets, customer behavior and more for commercial uses. The inventors of data science borrowed from statistics, machine learning and database management to create a whole new set of tools for those working with data.

Statistics, on the other hand, has not changed significantly in response to new technology. The field continues to emphasize theory, and introductory statistics courses focus more on hypothesis testing than statistical computing.

The article goes on to provide a brief description of the data scientist:

Statistician and data visualizer Nathan Yau of Flowing Data suggests that data scientists typically have 3 major skills:

(1) They have a strong knowledge of basic statistics and machine learning—or at least enough to avoid misinterpreting correlation for causation, or extrapolating too much from a small sample size.

(2) They have the computer science skills to take an unruly dataset and use a programming language (like R or Python) to make it easy to analyze.

(3) They can visualize and summarize their data and their analysis in a way that is meaningful to somebody less conversant in data.

Andrew Gelman, a statistician at Columbia University, writes that it is “fair to consider statistics… as a subset of data science” and probably the “least important” aspect. He suggests that the administrative aspects of dealing with data like harvesting, processing, storing and cleaning are more central to data science than hard core statistics.

As we try to bring aspects of this important new field to Medtronic, it will be fascinating to watch how existing personnel can grow into these types of skills. We may also need to bring in some others to help us learn these new skills and cultivate centers of excellence built around the meaningful presentation of data findings!

Monday, October 19, 2015

Rich data, poor data

Original post: Mar 12, 2015

Yesterday, I posted about a math whiz who is trying to help clinicians understand who gets cancer and why (Dr. Data). I believe that there are many mysteries that are ultimately "solvable" with the proper application of statistical analysis. The trick is in getting to that proper analysis. Sometimes you don't have any data to analyze. Sometimes you have so much that it is difficult to tell the chaff from the wheat.

We have often heard that healthcare is different from other industries. It is, but it isn't always a positive. There have been many articles written recently that talk about the wide variance in cost between medical procedures. I plan to post next week about one article I saw that found the exact same blood procedure could be priced at $10 or $10,000 depending on where you had the procedure done.

When you have an industry that considers itself an outlier, one of the best ways to help it change is to use models developed in other industries and adapt them to fit to the specialized nature of healthcare. I believe that data analysis is similar.

This article from FiveThirtyEight by Nate Silver discusses the use of data analysis in sports. Sabermetrics in baseball helped pave the way for detailed analysis in all the major sports. Theories that once were held only by fringe elements are now considered mainstream. It's hard to believe that when I was a boy, only a handful of people understood what "on base percentage" was and, more importantly, why it figured so prominently in baseball! Today, we routinely accept nuggets of information like the distance each player is estimated to have run in the course of a soccer match ticking across the bottom of our screens.

Mr. Silver goes on to explain why statistical analysis has done so well in sports.

Sports has awesome data. There are not just reams of statistics everywhere you turn. It is also rich data. Silver defines rich data as accurate, precise, and subject to rigorous quality control.
In sports, we know the rules. With a common understanding of the goals and objectives, we can generally agree on those measures which show gains or losses.
Sports offers fast feedback and clear marks of success. "Winning", as defined by the rules of the game, becomes the ultimate arbiter of success. With regular intervals of games adding rich sources of new data, you can test theories quickly and determine whether or not they are achieving the results you expected.

Here is a link to the full article: Rich Data, Poor Data | FiveThirtyEight

Sunday, June 14, 2015

The limits of Big Data

Original post: Apr 15, 2014

Today's post comes from an article on big data in the New York Times: http://www.nytimes.com/2014/04/07/opinion/eight-no-nine-problems-with-big-data.html?_r=0

I am an enthusiastic supporter of the power of big data. On the one hand, I truly believe that there is significant power in the proper application of statistics to help provide the supporting data for business decisions. On the other hand, there are many times when "big data" just seems like another buzzword that is used as a substitute for the "and then a miracle happens" magical thinking in certain unwieldy process maps!

What attracted me to this article was the way that it tempers the great expectations often applied to big data. It's important to note some of its limitations in order to grasp its power more fully.

Here is one example:

The first thing to note is that although big data is very good at detecting correlations, especially subtle correlations that an analysis of smaller data sets might miss, it never tells us which correlations are meaningful. A big data analysis might reveal, for instance, that from 2006 to 2011 the United States murder rate was well correlated with the market share of Internet Explorer: Both went down sharply. But it’s hard to imagine there is any causal relationship between the two. Likewise, from 1998 to 2007 the number of new cases of autism diagnosed was extremely well correlated with sales of organic food (both went up sharply), but identifying the correlation won’t by itself tell us whether diet has anything to do with autism.

Another critical point:

A sixth worry is the risk of too many correlations. If you look 100 times for correlations between two variables, you risk finding, purely by chance, about five bogus correlations that appear statistically significant — even though there is no actual meaningful connection between the variables. Absent careful supervision, the magnitudes of big data can greatly amplify such errors.

In my opinion, the most important part of big data is having someone with the ability to look at the various correlations and draw meaningful conclusions from them that can make a significant business impact. Knowing that Netflix users really like sophisticated drama and Kevin Spacey means nothing until you make the $100 million investment to create the US version of "House of Cards"!

Why I don't watch cable news

Original post: Nov 15, 2013

Statistics and probability are amazing tools. But, like all tools, they can be manipulated for malicious purposes.

I am especially wary of the way news is presented on TV. It seems like these shows naturally gravitate towards emotional stories that are designed to scare or outrage you. These shows are rarely lying. Rather, they are doing something far more pernicious. They are taking something that has a grain of truth and turning it into something completely unrecognizable.

One example is with scientific data. There may be a study that shows that a certain behavior may cause the rate of cancer to go from one in a million to two in a million. When you say it that way, it still seems like a nearly infinitesimal chance that it will affect the life of the average person. Yet that same news presented on a cable news show would likely shout out "Are you in danger of doubling your cancer risk? Find out at 11!"

I thought this cartoon did a good job of showing how things get twisted. It's almost like the scientific community is battling a bad case of the "telephone" game.