Pages

Monday, June 1, 2015

Statistics vs. Data Science

The house empiricism stands divided. Bayesians and frequentists have waged bitter war for decades for the heart of the common analyst. All the while data science stands poised to devour the ground on which they fight. With visionary voices on all sides of the fracas, which path should a neophyte choose. This is the turmoil into which science graduate students everywhere are thrown, unarmed with the tools necessary to critically differentiate the various approaches. The choice is often informed by early experiences and the normative influences of their field (not exactly an examination of merit). With each approach leading to slightly different conclusions given the same data, is it really any wonder the public is losing faith in science's ability to really know anything.

This problem of deciding what we know isn't exactly new, it has bedevilled us since ancient Greece. Our origins as scientists began with philosophers pondering how we actually know anything at all. The tip of the iceberg is the p-value debate, with such amazing headlines as "Psychology Journal Bans P-Values" but the problem is much deeper. The now tarnished and reviled p-value is treated as a probabilistic estimate of how likely we are to be right about a hypothesis, when in reality it is only an estimate of how likely we are to be a certain type of wrong. One might argue the issue is with education, if we all understood exactly what a p-value tells us we could return to the comfortable confines of the status quo. But that argument is facile. Null hypothesis testing itself is outmoded and should be abandoned in favour of model building and selection, using real world performance and cross validation to refine our understanding of the world at large. This treads into the domain of data science.

Data science is the young upstart field, using sleek new tools like deep nets and python, these analysts eschew traditional rigour in favour of gobbling huge quantities of rapidly consumable data and churning out interesting models. Their techniques and laissez-faire attitude have a deep appeal, but it's easy to wonder if the field isn't laden with false promises. The statistics orthodoxy would have us believe so.

The issue comes down to experimental design. How we actually perform science. A puritanical statistician would likely turn their nose up in disgust at the quality of the data we as scientists often produce. Many scientists I'm sure feel great deal of shame about the quality of their data, or have become numb to the shame, or worse, become derisive of those statisticians who make us feel this shame. Formerly I was in the latter category, fleeing from data-shame lead me into the warm embrace of machine learning and data science, this was a field that would welcome me with my dirty, haphazardly collected data. Machine learning would help me squeeze whatever drops of knowledge I could out of my data set, and wouldn't judge me for its imperfections. I thought machine learning would solve all my problems, but I was wrong. I realized today that the puritans have a point. We invest a huge amount of money in publicly funded science, this money should not be lost on poorly conceived and orchestrated research. The Natural Science and Engineering Research Council of Canada (NSERC) spent just over a billion dollars last year, and the National Science Foundation in the US spent seven and a half billion. This sure seems like a lot of money (although $30 per person in Canada really isn't so bad), shouldn't we be doing the very best we can? The answer is obviously yes, but this is where the complexity of the story begins. Optimizing knowledge output is about more than just maximizing how much we can comfortably squeeze out of a data set.

My take on statistical orthodoxy begins with thrift. Up until a few decades ago data and analyses were expensive. Statisticians were able to be thrifty with data and analysis by placing the burden of effort on scientist and analyst. With data science and machine learning, this issue has been turned on its head, now data and analysis is cheap, and training scientists to statistical proficiency is prohibitively expensive. Needless to say I think both camps have points. We need to think critically about where lies the balance between training budding scientists to be vigilante about things like collinearity and pseudo-replication vs. training scientists to collect it all and let the computer do the worrying. It is my strong belief that enough mediocre data will eventually match some smaller amount of great data in terms of inferential utility. If you accept that some amount of mediocre data can match a smaller amount of great data, then the camp you fall closer to is likely data science.

We need to think of scientists and the organizations that fund us as finite pools of resources. Each hour we spend training students to be statisticians is an hour they can be gathering data or learning about related field. And if we admit that humans are fallible and despite our best efforts we still end up with mediocre data, then we should focus training on tools robust to suboptimal design. There really is no reason these days not to collect it all and use modern computational tools to separate the wheat from the chaff. But I do have a much deeper appreciation for the stance of statistical orthodox folks. The slow, deliberate approach to knowledge acquisition has served us well for centuries, but in this modern data glut it may be a case of the swift and the dead.

- Chris

P.S. I recognize that most statisticians are not hard-line orthodox folks like I portray in this essay. But many scientists I've met to seem to live in fear of such shadowy boogeymen and teach their students to do the warding gestures of providing p-values despite arguments against it.

Acknowledgement: This blog post was inspired by a conversation with Dr. Tom Chapman, though the views expressed in the article are solely mine, unless you like them, in which case you can give him credit too if you want.