Tuesday, January 26, 2016

The difference between data science and scientists

Original post:  Oct 15, 2015

Some of us really enjoy working with data. Others might have special names for these individuals, but we soldier on, nonetheless.

With the advent of big data and a proliferation of sites dedicated to the art and science of parsing data, there are interesting new career paths opening up. One such category is "data science". Practitioners of these dark arts are called "data scientists". But what does that mean?

This graphic might help explain the concept a little more clearly:


Here is that article's definition of statistics compared to data science:

Statistics was primarily developed to help people deal with pre-computer data problems like testing the impact of fertilizer in agriculture, or figuring out the accuracy of an estimate from a small sample. Data science emphasizes the data problems of the 21st Century, like accessing information from large databases, writing code to manipulate data, and visualizing data.

Why was there a specialized need for such persons? Part of it grew out of the computers that humans developed to deal with massive amounts of data:

Several factors prompted these innovations: First, people needed to work with datasets, which we now call big data, that are larger than pre-computational statisticians could have imagined. Second, industry focused increasingly on making predictions about markets, customer behavior and more for commercial uses. The inventors of data science borrowed from statistics, machine learning and database management to create a whole new set of tools for those working with data.
Statistics, on the other hand, has not changed significantly in response to new technology. The field continues to emphasize theory, and introductory statistics courses focus more on hypothesis testing than statistical computing.
The article goes on to provide a brief description of the data scientist:

Statistician and data visualizer Nathan Yau of Flowing Data suggests that data scientists typically have 3 major skills:
(1) They have a strong knowledge of basic statistics and machine learning—or at least enough to avoid misinterpreting correlation for causation, or extrapolating too much from a small sample size.
(2) They have the computer science skills to take an unruly dataset and use a programming language (like R or Python) to make it easy to analyze.
(3) They can visualize and summarize their data and their analysis in a way that is meaningful to somebody less conversant in data.
Andrew Gelman, a statistician at Columbia University, writes that it is “fair to consider statistics… as a subset of data science” and probably the “least important” aspect. He suggests that the administrative aspects of dealing with data like harvesting, processing, storing and cleaning are more central to data science than hard core statistics.
As we try to bring aspects of this important new field to Medtronic, it will be fascinating to watch how existing personnel can grow into these types of skills. We may also need to bring in some others to help us learn these new skills and cultivate centers of excellence built around the meaningful presentation of data findings!

No comments:

Post a Comment