Tuesday, June 16, 2015

Like herding cats

Original post:  Aug 20, 2014

We expect amazing things out of technology. One of the latest buzzwords is "big data". It refers to the large streams of collected information that can ultimately yield incredibly astute insights. When it's done right, it can almost feel like magic.

Google uses the power of its vast treasure trove of search requests to provide you with the autocomplete function. It can sometimes feel as if it's reading your mind. In point of fact, it's really using sophisticated algorithms to attempt to predict your desired goal based on what millions of others have requested in the recent past.
google.png
This recent article in the NY Times discusses how difficult it can be to get to the pristine data that can yield these types of results. Most of the work that falls under the umbrella term "big data" is actually something far less attractive:

Yet far too much handcrafted work — what data scientists call “data wrangling,” “data munging” and “data janitor work” — is still required. Data scientists, according to interviews and expert estimates, spend from 50 percent to 80 percent of their time mired in this more mundane labor of collecting and preparing unruly digital data, before it can be explored for useful nuggets.

“Data wrangling is a huge — and surprisingly so — part of the job,” said Monica Rogati, vice president for data science at Jawbone, whose sensor-filled wristband and software track activity, sleep and food consumption, and suggest dietary and health tips based on the numbers. “It’s something that is not appreciated by data civilians. At times, it feels like everything we do.”

It continues:

“It’s an absolute myth that you can send an algorithm over raw data and have insights pop up,” said Jeffrey Heer, a professor of computer science at the University of Washington and a co-founder of Trifacta, a start-up based in San Francisco.

One major problem is the proliferation of data sources. There are all types of information swirling around with very little in common. Think of the challenge in gathering your own information and then multiply it by millions (or billions) of users. There are sensors and probes and reports; e-mail and tweets. How can you create some semblance of order to help organize it properly without manual intervention?

Here is an example from the article from the world of healthcare:

Data formats are one challenge, but so is the ambiguity of human language. Iodine, a new health start-up, gives consumers information on drug side effects and interactions. Its lists, graphics and text descriptions are the result of combining the data from clinical research, government reports and online surveys of people’s experience with specific drugs.

But the Food and Drug Administration, National Institutes of Health and pharmaceutical companies often apply slightly different terms to describe the same side effect. For example, “drowsiness,” “somnolence” and “sleepiness” are all used. A human would know they mean the same thing, but a software algorithm has to be programmed to make that interpretation. That kind of painstaking work must be repeated, time and again, on data projects.

Data experts try to automate as many steps in the process as possible. “But practically, because of the diversity of data, you spend a lot of your time being a data janitor, before you can get to the cool, sexy things that got you into the field in the first place,” said Matt Mohebbi, a data scientist and co-founder of Iodine.

funnel.jpg

These are some of the issues we'll have to work through before we can attain the real promise of big data in healthcare.

No comments:

Post a Comment