Pages

Monday, June 1, 2015

Statistics vs. Data Science

The house empiricism stands divided. Bayesians and frequentists have waged bitter war for decades for the heart of the common analyst. All the while data science stands poised to devour the ground on which they fight. With visionary voices on all sides of the fracas, which path should a neophyte choose. This is the turmoil into which science graduate students everywhere are thrown, unarmed with the tools necessary to critically differentiate the various approaches. The choice is often informed by early experiences and the normative influences of their field (not exactly an examination of merit). With each approach leading to slightly different conclusions given the same data, is it really any wonder the public is losing faith in science's ability to really know anything.

This problem of deciding what we know isn't exactly new, it has bedevilled us since ancient Greece. Our origins as scientists began with philosophers pondering how we actually know anything at all. The tip of the iceberg is the p-value debate, with such amazing headlines as "Psychology Journal Bans P-Values" but the problem is much deeper. The now tarnished and reviled p-value is treated as a probabilistic estimate of how likely we are to be right about a hypothesis, when in reality it is only an estimate of how likely we are to be a certain type of wrong. One might argue the issue is with education, if we all understood exactly what a p-value tells us we could return to the comfortable confines of the status quo. But that argument is facile. Null hypothesis testing itself is outmoded and should be abandoned in favour of model building and selection, using real world performance and cross validation to refine our understanding of the world at large. This treads into the domain of data science.

Data science is the young upstart field, using sleek new tools like deep nets and python, these analysts eschew traditional rigour in favour of gobbling huge quantities of rapidly consumable data and churning out interesting models. Their techniques and laissez-faire attitude have a deep appeal, but it's easy to wonder if the field isn't laden with false promises. The statistics orthodoxy would have us believe so.

The issue comes down to experimental design. How we actually perform science. A puritanical statistician would likely turn their nose up in disgust at the quality of the data we as scientists often produce. Many scientists I'm sure feel great deal of shame about the quality of their data, or have become numb to the shame, or worse, become derisive of those statisticians who make us feel this shame. Formerly I was in the latter category, fleeing from data-shame lead me into the warm embrace of machine learning and data science, this was a field that would welcome me with my dirty, haphazardly collected data. Machine learning would help me squeeze whatever drops of knowledge I could out of my data set, and wouldn't judge me for its imperfections. I thought machine learning would solve all my problems, but I was wrong. I realized today that the puritans have a point. We invest a huge amount of money in publicly funded science, this money should not be lost on poorly conceived and orchestrated research. The Natural Science and Engineering Research Council of Canada (NSERC) spent just over a billion dollars last year, and the National Science Foundation in the US spent seven and a half billion. This sure seems like a lot of money (although $30 per person in Canada really isn't so bad), shouldn't we be doing the very best we can? The answer is obviously yes, but this is where the complexity of the story begins. Optimizing knowledge output is about more than just maximizing how much we can comfortably squeeze out of a data set.

My take on statistical orthodoxy begins with thrift. Up until a few decades ago data and analyses were expensive. Statisticians were able to be thrifty with data and analysis by placing the burden of effort on scientist and analyst. With data science and machine learning, this issue has been turned on its head, now data and analysis is cheap, and training scientists to statistical proficiency is prohibitively expensive. Needless to say I think both camps have points. We need to think critically about where lies the balance between training budding scientists to be vigilante about things like collinearity and pseudo-replication vs. training scientists to collect it all and let the computer do the worrying. It is my strong belief that enough mediocre data will eventually match some smaller amount of great data in terms of inferential utility. If you accept that some amount of mediocre data can match a smaller amount of great data, then the camp you fall closer to is likely data science.

We need to think of scientists and the organizations that fund us as finite pools of resources. Each hour we spend training students to be statisticians is an hour they can be gathering data or learning about related field. And if we admit that humans are fallible and despite our best efforts we still end up with mediocre data, then we should focus training on tools robust to suboptimal design. There really is no reason these days not to collect it all and use modern computational tools to separate the wheat from the chaff. But I do have a much deeper appreciation for the stance of statistical orthodox folks. The slow, deliberate approach to knowledge acquisition has served us well for centuries, but in this modern data glut it may be a case of the swift and the dead.

- Chris

P.S. I recognize that most statisticians are not hard-line orthodox folks like I portray in this essay. But many scientists I've met to seem to live in fear of such shadowy boogeymen and teach their students to do the warding gestures of providing p-values despite arguments against it.

Acknowledgement: This blog post was inspired by a conversation with Dr. Tom Chapman, though the views expressed in the article are solely mine, unless you like them, in which case you can give him credit too if you want.
 

Tuesday, April 14, 2015

Breaking Up With Powerpoint

I’m breaking up with powerpoint. I’ve known this day would come for a while now, but its shocking its finally here. There are academic arguments for its abandonment, but none really compelled me. The honest truth is I’ve finally found something better.

Two weeks ago I did something new, I wrote and delivered a presentation on graph theory and interactive data analysis to a mixed crowd of upper year undergrads and grad students. What was special about this presentation was that powerpoint was nowhere to be seen, not a familiar microsoft trapping in sight. This presentation was a beamer presentation. I wrote it in R Markdow supplemented with some raw LaTeX, and now I don’t think I can ever back. I’m breaking up with powerpoint, and I think you should too.

Background

If you’re not familiar yet with R Markdown, I recommend you go back and read my introduction to Knitr and R Markdown. It has become so natural to do the bulk of my work without leaving the comfortable confines of R Studio that I keep looking for more tasks I can do without switching software. Presentations were a logical next step. Previously I had dipped my toes into the problem of authoring presentations with markdown. I used ioslides (another presentation format offered by R Markdown), but was unsatisfied with the level of customization I could achieve (with my primitive knowledge of javascript). So I tried again, but this time with beamer. Beamer for those who haven’t spent much time swimming in the LaTeX pools is a convenient package for rendering LaTeX code as pdf slide-decks. I encountered beamer for first when I tried learning LaTeX originally but never had enough time or drive to master it. However, now with the added ease offered by R Markdown I decided to give it another shot.

These are the results:

The code is available from my github

Other than the relatively uninspiring title page the document came out beautifully. Figures rendered wonderfully, code seamlessly integrated into the slides, natural sub-sectioning, I can’t wait to write more like it. I recommend you quickly scroll through the document to see just how simple the document turned out to be (after code headers).

Getting Started

I won’t lie, there were quite a few gotchas1 along the way, but you get the opportunity to learn from my mistakes. To start a beamer presentation in R Studio, create a new markdown document as I discussed in the post about markdown, but instead of choosing the default settings, click the panel labelled presentation, then select beamer and ok

R Studio throws in some demonstration slides to give you a taste for how to make your presentation, you can go ahead an delete that (though keep the yaml block at the very top [the stuff enclosed by — ]) because I’ll walk you through how to write a really simple presentation.

First Gotcha, YAML Headers, and Themes

A problem I ran in to (yet haven’t done my due dilligence and reported it) was that I couldn’t resize the code in my document. As there isn’t a burgeoning community of R Markdown –> Knitr –> Beamer users, tracking down which component of the pipeline isn’t working right and finding a fix is challenging. I found references to a workaround by Yuihui Xie (the creator of Knitr) for getting the code to the right size, but it didn’t work for me, and supposedly is no longer necessary anyway. He was using Knitr –> Beamer so the issue could be in R Markdown. I created a work-around that made the code font smaller but left the output font gargantuan, it was sufficient for my purposes. You can grab the modified template I used from my github by running:

library(RCurl)
gistUrl <- "https://gist.githubusercontent.com/cfhammill/b5ba7767d7729bd676a2/raw/987d43694eda1fc263efdd38af03f846db80e690/resizeTemplate.beamer"

write(getURL(gistUrl), "resizeTemplate.beamer") 

Then you can add template: resizeTemplate.beamer to your yaml header. Also if you’re interested in using a theme to beautify your document you can add that in the header as well:

---
title: "A title"
author: "Your Name"
date: 'Today's date'
output:
  beamer_presentation:
    theme: "Boadilla"
    template: resizeTemplate.beamer
---

I used the theme boadilla but there are many others to choose from. To find the theme that’s right for you check out the gallery by Ian Blaines to see one presentation rendered in many different themes.

Slides

Once that is set up, you can start writing your presentation. By default, new slides begin with every level 2 header, or line break. To create two slides (plus your title slide) you can add the following code to get a titled slide and an untitled slide:

## Slide 1

Some Slide Contents!

------------------

Untitled slide 2

Images

Next thing you might want to try is to add some images into your documents.

To add pictures, you can use the default markdown code:

![](path/to/pic.png)

But I found myself unsatisfied with the default sizing and positioning. I wanted a centred picture of a certain size. To achieve that I needed to write some raw LaTeX

\centering \scalebox{0.45}{\includegraphics{path/to/pic.png}}

Centering indicates the line should be centered, and since LaTeX treates included graphics as large characters that will center your image. The \scalebox command resizes the image as you’d expect (with numbers larger than one expanding it). All and all not too complicated.

Bullet Points and Sequence

To have a series of bullet points in your slide you just need to create a bulleted list the default markdown way

#Bulleted Slide

- Isn't
- This 
- Easy

And you’ll get a nice bulleted slide. If you’re like me and want some but not all of your bullet points to come in sequentially you can add incremental = true to your yaml block after beamer_presentation: but I found it easier to leave that out and specify manually where I’d like my bullets to be sequential. To force sequential bullets (or if incremental is true to force static bullets, which isn’t documented on R Markdown’s webpage) you just need to add the greater than sign before the bullet.

#Sequential Bullet Slide

>- Wait for it
>- .
>- ..
>- ...
>- Point!

Images can be made sequential too by putting them in a sequential bullet.

Bullet Spacing

The applies to line spacing in general, markdown ignores extra white space by default, so trying to force extra space between points isn’t as easy as one might hope (although there is probably a way to do it with your LaTeX header or yaml header). The solution I found was to manually include LaTeX line-breaks

#Spaced Out Bullets

- Point 1 \newline
- Point 2 \newline
- Point 3 \newline

Which is useful if you, like me, try to keep text to a minimum so using white space effectively is key.

Resizing Font

To resize font in your document you can use LaTeX’s font sizing codes e.g. (\large{your text}, \Large{your text}, \tiny{your text}, etc.)

This was useful for me to make better use of the slide space with sparse text (lots of line spacing and a bigger font), and for emphasis without using headers which can trigger some unwanted stylistic changes.

Outro

With that you now know about as much as I do about creating presentations with Beamer via Knitr via R Markdown. It’s pretty straight forward, if you ever do presentations that involve code, equations, and figures I can’t recommend it enough. I hope you’re inspired to try your next presentation without powerpoint.

-Chris

Bonus Trick For Those interested: in the presentation, the red X and green check mark were made using grid graphics directly from within R. I previously wrote a little about using ggplot2 in unexpected ways, this used some of those lessons. By using the grid package directly you can draw whatever you like on a plot canvas, check out the presentation code for how I did it.


  1. “Gotcha” is a programming term for a little irksome quirk of a language or tool that cause it to perform in unexpected or counter intuitive ways

Tuesday, March 3, 2015

Getting Started With R Markdown

Intro

For today’s post, I’d like to continue on the R Markdown theme started in my last post, and give a brief introduction to authoring documents using R Markdown and Knitr. If you’re completely new to Knitr and R Markdown I recommend reading the “Back to basics” paragraph in my last post to get a feel for the context and purposes of these tools.

These tools fill essentially the same niche for the R community as ipython notebooks do in the python community. My experience with ipython notebooks is minimal relative to my markdown experience and so I don’t feel qualified to compare the two, but I will say that ipython notebooks aren’t the only way to share code and rationale all in one document.

My affection for R Markdown and Knitr is strong, but there were some growing pains. When I started learning R Markdown, I had a tiny bit of experience with Knitr already but nothing resembling expertise, the resultant challenge was having to learn both tools concurrently, flipping back and forth between documentation that, at the time, felt more like a technical showcase. I don’t want others to have to share that struggle, so here’s my attempt at streamlining the plunge into “reproducible research”1.

Sources and Suggestions

Before we jump in to authoring our first (second, thirtieth) document using R Markdown, I feel I need to pay homage to the materials I learned from. If a topic is not covered in this introduction you will (hopefully) be able to find it in one of these sources.

  • R Studio’s R Markdown Documentation: This page is the official documentation for R Markdown. It is a treasure trove of information on how to acheive different stylistic outcomes using R Markdown.
  • Yuhui Xie’s Knitr Documentation: This page is the official documentation for Knitr (in essence R Markdown is just a convenient interface to Knitr). This page is relatively comprehensive but I found it hard to use.
  • Pandoc’s Documentation: R Markdown uses the markdown conventions of pandoc, so if something is absent from the R Markdown documention, it’s worth examining the pandoc documention.

These three sources (plus lots of trial and error) are sufficient to learn how to produce high quality documents with R Markdown. Also I implore you to use R Studio when writing your documents. The people behind R Studio created R Markdown, and because of their deep interest in the format they have provided many conveniences you’d miss if you tried authoring from your text editor and command line. These instructions were written assuming you’re using R Studio, if not you will have to determine some of the housekeeping steps for yourself.

Getting Started

Alright! Now let’s get authoring! First thing you’re going to want to do is create a new file in R Studio. If you click the new file button, you’ll notice that there are options other than R Script for the type of file to create. Below the R Script option is the R Markdown option, choose that one. This gives you the option to give your document a title and your name and choose the output format for your document. By default the Knitr knits (creates the final document) to .html, but it can also create .pdf’s and .doc’s if you’d like. R Studio puts some demonstration markdown in the document, you can go ahead and erase that, we’ll be writing our document from scratch. Now save this document somewhere, by default the document knits to the same directory as the file. The file now exists as a .rmd, pretty easy to remember and easy to keep separate from your R scripts.

Give it a title!

R Markdown supports a couple of different heading schemes, the two I use rely on the number sign, the equal sign, and the dash. To give your document a title try one of the following:

# My title

or

My title
=========

Both work, but note, there can’t be a space between the beginning of the newline and the number sign or equal signs otherwise the special meaning of those characters is lost.

To add sub-headings, either use

## My sub-heading

or

My sub-heading
---------------

At first I tended to use the number sign method, but now I tend to use the underlining method. The advantage to the number sign method is that you can have greater than two heading levels (just keep adding number signs to represent deeper levels). I find I rarelly go deeper than two sub-headings, and the underlining method looks neater as you’re writing.

Write a paragraph

First thing to note about writing text paragraphs is that R Markdown does not honour your newline characters as you might imagine it would. Because scripting windows on differ in size, R Studio will auto-wrap long lines for you, and if you choose to manually break your lines it will ignore that when rendering the final document. To instruct R Markdown to honour a line break, either place two spaces at the end of the preceding line, or add a full empty line between the old and new paragraphs. Using more than one empty line to divide paragraphs when writing is perfectly acceptable and R Markdown will ignore additional newlines. This behaviour stems from markdown’s original mission: to produce documents that are readable as plain-text that produce well formatted documents when converted to other document types, so facillities exist to make plain-text look nicer (lots of stylistic white space).

So go ahead an write a little bit about what you’ve learned about R Markdown so far (or whatever you want), hit the knit HTML button in the top of the scripting window. Now have a look at your beautiful handiwork, pretty snazzy right? Once you’ve patted yourself on the back lets move on to the next step, adding some code!

Add some code

R Markdown has two primary modes of displaying code, there is inline code and code chunks. Inline code is primarily for small snippets, like variable names and parameter assignments. To denote an inline code segment, surround it with backticks (`, not ’).

The more interesting way to display code are code chunks, these allow code, results, and figures to be presented together in text. In all honesty, if R Markdown was just a tool for making writing HTML a lot easier it probably wouldn’t have earned this post. Of course the markdown family is incredibly useful if you publish a lot of web content that you want to make available quickly, but the real beauty of R Markdown is its ability to display and execute code from a variety of different languages. Just last week I posted my adventure into adding ipython as an interpreter(engine) for Knitr, but you can include code from python, haskell, bash, c, and fortran just to name a few2,3.

To add a code “chunk” into a document, you create a fenced code block like so

```{r}
# My fantastic R code!
```

What R Markdown (and eventually Knitr) does with your code chunk is highly customizable. You can have code that is shown but isn’t executed, code that is executed but isn’t shown, code that has its results saved so that it doesn’t have to be recalculated if your change the document, and a myriad of other potential customizations. So I think now is as good a time as any to introduce chunk options.

Customize your code chunks

One of the cryptic things I had to figure out when starting R Markdown is what do all of these chunk options do, and how to I use them. I’m going to cover five of the basic ones that will get a lot of use, and a few that may be useful more occaisonally.

  • eval: This option controls whether your carefully crafted code is executed while the document is being knit. Often I’m including demonstration code that doesn’t need to be run each time (or may not work out of context), so I’ll often set eval = FALSE, eval is TRUE by default.
  • echo: This option controls whether to display the code, often I’ll have a set-up chunk at the beginning of the document to do things like set my working directory, load packages, and import data. These aren’t interesting to a reader so I set echo = FALSE, echo is TRUE by default
  • results: This option controls whether to output the results of running the code into the document. Sometimes hiding the results of executed code makes sense, I often use results = FALSE with echo = FALSE, when my set-up chunk has output.
  • cache: This option allows Knitr to remember the results of executing a code chunk so that unless the code changes it doesn’t need to be re-run. This is useful when executing code that takes a while to run. R Markdown typically involves a lot of guess and check, so even a few minutes of execution time can make the authoring process unpleasant, for those code chunks set cache = TRUE, it is FALSE by default.
  • engine: This option allows you to choose another language engine to execute the code, perfect for presenting or comparing another language in your document. Just set engine = "name_of_engine" in the code chunk.
  • message,warning, and error: These options let you control whether Knitr should report messages, warnings, and errors in the document. Sometimes these are worth suppressing, by default they are shown (set to FALSE to suppress them).

To set these options on a chunk-by-chunk basis, just include all your options in the curly braces at the top of the chunk like so

```{r, eval = FALSE, echo = TRUE}
#Some code to be shown but not run
```

If you find yourself using the exact same option in all of your code chunks, you can set document-wide defaults. These defaults will be overriden by individual chunk options, so you are not locked-in even if you do set a global option. As I mentioned above, I like to have a set-up chunk at the top of my document performing all the preparation necessary for my code to run smoothly, this is the ideal place to set global options. Global options are set with opts_chunk$set(option = value)

Sample set-up chunk

I highly recommend including a set-up chunk in all of your documents, this is an ideal place to put system specific code, because in most cases, no one but you needs to know where on your computer your files are located, no one but you needs to know what global knitr options you used, et cetera.

```
setwd(“my/working/directory”)
myData <- readRDS(“somePreparedData.rds”)

library(knitr)
opts_chunk$set(echo = FALSE)
```

This will set your working directory, pull in some data, and tell Knitr not to bother showing the code (this is useful if you want to use R Markdown to keep figures with text, without the cumbersome code). Unfortunately you do have to tell R to load the Knitr package, the code gets executed in its own environment and needs to be made aware of the opts_chunk object in Knitr.

Did I mention figures?!

Another supremely useful feature of R Markdown is the ability to generate and keed your figures in the document with your code and writing. R Markdown is pretty smart about figuring out what to do with plots. The following graph was created inside a normal code chunk, with no specific options set:

x <- seq(-1,1, .01)
plot(x, x^2, type = "l", ylim = c(0,1.6), lwd = 3)
points(c(-.65, .65), c(1.5,1.5), cex = 3)
points(0, .9, pch = 2, cex = 2)

And now I’ll always remember how to draw a creepy smiley face using base R’s plotting functionality. Perhaps you can see why suppressing the code for an intricate figure may be useful.

Math

Often you’ll want to present equations in with your code to do this R Markdown allows your to write LaTeX directly in your document. Knitr automatically renders these as either LaTeX or Math Jax depending on your output format. Latex code can be specified as either inline or block, just like code. An inline math segment begins and ends with a dollar sign $...$ and block latex uses two dollar signs $$...$$.

Sharing your work (on the internet)

When I create an html report using R Markdown one of two things happen. The first and easiest is I scp the html file produced by knit HTML onto the server where my personal webpage lives. Or if I’m publishing to blogger (where this blog is hosted), I go through the rigmarole of copying the source code, removing extraneous tags, and pasting it into the html window of a post on blogger. Additionally you could share the file with the intended audience directly. But in any case, once you’ve created the html file, what you choose to do with it is up to you.

Take-Aways

So I hope this tutorial gave you a little bit of a primer on how to use R Markdown to keep your code, writing, and figures all together in one place, and I hope the benefit of doing so is apparent. I’d like to leave you with a few things:

  • Author documents with R Markdown. It’s easy, relatively painless, and helps keep scatter-brains like me organized. I report to my bosses for two of my jobs with a web page I make with R Markdown. It allows them to see relevant snippets of my code and my analysis of what’s happening.
  • Use a set-up chunk at the begining of your document, have a look at the sample above, it will make your document more compact and cleaner if all the mechanical stuff happens in the beginning behind the scenes.
  • Know where to look for help, see the resources I posted at the top of this post, especially be aware that pandoc’s documentation has things that the R Markdown documentation missed.
  • If your header/table/etc isn’t rendering properly, make sure it isn’t just an issue with a space after a newline. That space can deactivate all kinds of special meaning, so lookout.
  • Use R Markdown as an excuse to write often about things that interest you, it’s a great way to keep sharp.

If you have any questions please feel free to post them below, I’ll try my best to answer them. Also this document was made with R Markdown so I’ve made the .rmd that produced this post available for inspection.

Happy authoring!

Chris


  1. Here I fall victim to using the buzzword, but its easier than “code/document integration” or anything else I could think of. Let it be known that I think these tools are more than just for making research reproducible, but also to facillitate sharing ideas, and authoring documents in an age where code can be as important as words

  2. You can see the full list of available engines with library(knitr); names(knit_engines$get())

  3. I just noticed there’s no lisp engine! That needs to change