Pages

Wednesday, July 23, 2014

Advanced ggplot2:
How to add overlays, underlays, and make multi-axis plots

Intro

Hi Readers,

Previously I mentioned I’m working on a contract for the Newfoundland government. For this contract I’ve had to make a number of figures for a report that’s coming out in the near future on the state of Caribou in the province. When I began the contract I was new to ggplot2, a nice graphics package by R guru Hadley Wickham which has become de rigueur in the circles I find myself in. Now I use it all the time when I want to make high quality figures.

For its many virtues there are drawbacks that come with producing figures with ggplot2. The primary difficulty is the syntax is inobvious without having read Wickham’s book (or the Wilkinson book that inspired it). Additionally there is an unfortunate dearth of documentation on the finer points of package.
Like many attempts to simplify tricky topics, the simplification adds its own baggage that confounds its initial objective. However, ggplot2 is immensely powerful and for most applications it is the best option available, I recommend it, but be sure to devote some time to learning it.

In this post I want to walk you through how to make a watermarked plot. Sometimes a bit of background information can keep a viewer in tune with the message of the figure.

A Cut to the Chase


One key frustration I experienced was the
overly-pedanticomission of dual axis plotting. Granted there are many work arounds to this problem and with a little digging you’re likely to find Wickham’s work around, which exposes the tender underbelly of the package. This workaround showed me just enough of the inner workings of the package to be able to meet my client’s demands for a three-panelled plot with a watermark (things that aren’t available through the surface user interface of ggplot2.) I’d like to show you how you to do it.

The Setup


First lets get the packages we need installed and loaded:

install.packages(c("ggplot2", "grid", "gtable"))
library(ggplot2)
library(grid)
library(gtable)
 
The second and third packages are probably new to you so I’ll spend a paragraph to explain what they’re all about.

Wickham’s ggplot2 package was built on the pre-existing infrastructure of grid graphics. Grid is a low level plotting interface that suspends the user above the gory details of specifying an image tuple1 by tuple, but not so far as to lose any power. Grid-like plotting may be familiar to you if you have ever programmed graphics from scratch in a lower level language than R.

The grid objects (grobs) produced by ggplot2 are hopelessly complicated and innacessible for direct manipulation. However, the package gtable exists to help us bridge the divide. So we use gtable to convert a ggplot2 image into a table of grobs, from this table we extract just the peices we want so that we can manually add them to our plots later using the grid package directly.

Back to the task at hand. We’re going to need some data for our plots. Lets make a quartic and look at its first three derivatives. We’ll use the original function as a watermark to compare the derivatives

x <- -10:10

y4 <- x^4 + 3*x^3 + 5*x^2 - 10*x
y3 <- 4*x^3 + 9*x^2 + 10*x - 10
y2 <- 12*x^2 + 18*x + 10
y1 <- 24*x + 18

polyData <- data.frame(x, y4, y3, y2, y1)
 

Getting the Watermark


To pull out our watermark line, we first have to put it in a plot

quartPlot <- ggplot(data = polyData, aes(x = x, y = y4)) + 
                    geom_line(colour = "grey80", alpha = .75, size = 1)

Now we build our gtable from quartPlot. The structure of a gtable is hopelessly complicated so ignore this if you’re not in the mood for some head scratching. What I didn’t borrow straight from Wickham I gleaned through agonizing trial and error.

By searching through the “grob” element of your gtable for the element that matches up with the “name” element “panel”, we can find the plotting panel where our line lives. The panel grob has 3 children, some gridlines, backgrounds, and our line. The line turned out to be child number 2.

grobTab <- ggplot_gtable(ggplot_build(quartPlot))
quartLine <- grobTab$grob[[which(grobTab$layout$name == "panel")]]$children[[2]]

So now we have our watermark line ready to be added to our future plots.

The Other Plots


The other plots are going to need special themes to ensure the watermark is visible. Themes are how ggplot2 controls the final look of a plot. By specifying the background of both the panel and plot windows as “element_blank()” you ensure that your plots to cover your watermark (this took me far too long to figure out, I had the watermark superimposed but highly transparent because I had given up on having it in the background.) I also removed the grid lines because they’re obtrusive2.

basePlot <- ggplot(data = polyData, aes(x = x)) +
  theme(panel.background = element_blank(),
        plot.background = element_blank(),
        panel.grid = element_blank())

cubicPlot <- basePlot + geom_line(aes(y = y3), size = 1.5) + ylab("Cubic") + 
  theme(axis.title.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.x = element_blank())

quadPlot <- basePlot + geom_line(aes(y = y2), size = 1.5) + ylab("Quadratic") + 
  theme(axis.title.x = element_blank(),
        axis.ticks.x = element_blank(),
        axis.text.x = element_blank())

linePlot <- basePlot + geom_line(aes(y = y1), size = 1.5) + ylab("Line")

Note that for the quadratic and cubic plots I also removed all traces of the x axis, so the plots can visually share the x-axis (although in reality they do not).

Setting up Viewports


A useful grid-ism to understand for working with ggplot2 is the concept of a viewport. Viewports are regions of the screen where things can be plotted. They’re specified as proportions of the screen for their width and height, with an x and y coordinates for the middle of the viewport. I could explain further but it will be quicker to show them in action

vp1 <- viewport(width = 1, height = .33, y = 1/6, x = 1/2)
vp2 <- viewport(width = 1, height = .33, y = 3/6, x = 1/2)
vp3 <- viewport(width = 1, height = .33, y = 5/6, x = 1/2)
 
This code creates three viewports each occupying a third of the screen vertically.
We also need a viewport for our watermark but this takes some trial and error to fit well

quartLine$vp <- viewport(width = .8, height = .91, x = .55, y = .55)

The Final Product

grid.newpage()
grid.draw(quartLine)

print(linePlot, vp = vp1)
print(quadPlot, vp = vp2)
print(cubicPlot, vp = vp3)
 
And voila, there you have a plot with a watermark to show off, there are lots of ways to customize these points, for my contract I included axis lines like so

grid.newpage()
grid.draw(quartLine)

print(linePlot + theme(axis.line = element_line(colour = "black")), vp = vp1)
print(quadPlot + theme(axis.line = element_line(colour = "black")), vp = vp2)
print(cubicPlot + theme(axis.line = element_line(colour = "black")), vp = vp3)
 


For highly technical work it is important to note that the panels will not necessarily be in alignment. For example, because the longest axis label on the cubic plot is a digit longer than the other plots it’s drawing panel (where the plotting happens) is one digit narrower.

This can probably be circumvented (though I’ve never tried) by manually specifying the size of the label box in your themes.

Outro

So there is how you can get around some of the limitations of ggplot2 to make really nifty figures for your own work and analyses.

If you’re really intrigued, maybe investigate grid more deeply and shed the overhead of ggplot2, or you can delve deeper into how to combine the two to make a tricked-out personalized ggplot2 workflow to make some astonishing figures.

I hope you enjoyed this little journey through the innards of ggplot2.

-Chris

  1. If you’re not familiar with the term tuple it’s a very useful term for vectors of coordinates or quantities. I wish I had learned it earlier in life. From high-school onward we’re used to specifying cartesian coordinates like (x,y). This bracketed construct of coordinates form a couple that represents a location in 2D space. When drawing an image we might want to tell our computers to draw a point at (x,y) and give it colour c, we can specify this as a triple (x,y,c). Adding two more quantities would give us a quintuple. By now you’ve noticed the pattern. This extends to as many quantities as you’d like, the genotype of an individual at many loci could be treated as an n-tuple.
  2. I haven’t gone back to check yet, but I think if you set all the panel elements to “element_blank()” in the plot we used for the watermark you can ensure that you only have one child in your panel. This would make life easier because you wouldn’t have to search for which child is the line, you could just take element one every time.

Friday, July 18, 2014

Introducing FriendlyShiny:
A package to make interactivity with R easy


I currently am working on a contract for the Government of Newfoundland and Labradour’s Department of Environment & Conservation, and they have me making graphics for an upcoming report. While working on these graphics I repeatedly find myself fixing graphic parameters to try to improve the aesthetic appeal. It’s a tedious process including a lot more “guess-and-check” than I’d really like to admit.

If you’ve read a few of my other posts you’d have caught that I recently learned shiny in order to bring you the Drink Name Generator. I thought to myself surely there must be a way to leverage the power of shiny to perform parameter fine tuning in a way that doesn’t create more overhead than it’s worth.

By the time I had that thought I was already lost. I spent most of the past two days putting together a small package that gives a (in my opinion) intuitive interface to shiny. The package allows for applets to be created from a pre-existing R code with only a few minor changes.

Without further ado, the README from my GitHub

FriendlyShiny is my attempt at making the wonderful reactive code abilities of R’s shiny package more accessible to novice users, and folks who want interactive code quickly. FriendlyShiny provides a simple syntax for specifying reactive elements in code chunk without the overhead of designing the user interface and coding the applet by hand.

To allow interactivity for a code chunk, it just needs to be wrapped in an interact function call.

Consider trying to teach a math class about line specifications. You’ve taught them a line can be specified by y = mx + b but you’d like to show them how the line changes with its parameters.

You could write demostration code like this to plot a line, and then you could bring the plot up in front of your students

  slope <- 1
  intercept <- 0
  
  plot(0,
       ylim = c(0, 15),
       xlim = c(0, 10),
       xlab = "x",
       ylab = "y",
       col = 0)
  
  abline(intercept, slope)
 
Now you could go back to your code and change the values of slope and intercept by hand, OR you can make it interactive very simply like so:

interact(
{
  slope <- sI("slope", 1, min = 1/3, max = 3)
  intercept <- nI("intercept", 0)

  plot(0,
       ylim = c(0, 15),
       xlim = c(0, 10),
       xlab = "x",
       ylab = "y",
       col = 0)
  
  abline(intercept, slope)
  
}, outputType = "plot")
 
All you need to do is wrap your code in curly braces and a call to interact, then you need to specify you want a slider to help you choose the slope, and a numerical input box to specify the intercept (you probably want a slider for that too, but I’m sure you can see how to fix it).

The last argument after the curly braces tells the interact function whether to render your output as text or as a plot. Text output is useful for fine tuning parameters for modelling parameters and plot output is useful for getting graphing parameters juuust right.

Supported Widgets


Currently friendlyShiny supports:
  • SliderInput: sI(name, start, min, max, step = NULL)
  • NumericInput: nI(name, start)
  • TextInput: tI(name, start = "")
  • SelectInput (i.e. Drop down box): dI(name, type, start, ...) where … are your choices (comma delimited) and type can be “character” or “numeric”
  • RadioButtons: rI(name, type, start, …) exactly like dI but a different choice mode
  • CheckboxInput: cI(name, start = FALSE) toggles a logical between true and false

How to install


If you’re on here you probably know how to do this better than I do, but you can install this package to play with and improve (if you’re up to the task) by running the following commands in R:
install.packages("devtools") #if you don't already have devtools installed
library(devtools)

install_git("cfhammill/friendlyShiny")
 

Outro


I hope you like my package! I sincerely hope someone takes the reigns on this project, anyone can take my code and turn it into something amazing. I don’t intend to polish it much more than this, I just thought shiny should be accessible to everyone. If you notice any glaring errors please let me know.

-Chris

Saturday, July 12, 2014

Build Your Bar Project:
What's In A Drink Name / Drink Name Generator

This post is a short-attention-span version of my previous post, plus a little gadget I wrote to generate drink names.

Recap


I have been looking at a data set containing names and recipes for ~17000 cocktails. The original goal of the project was to use publicly available recipe data to inform a decision of what drinks to buy for a small home bar. The project since spun out of control because I realized the wealth of interesting trivia in the data.

How Do People Name Their Cocktails


Conceptually it might be worthwhile to separate out drinks into several naming classes based on their popularity. There are popular drinks that can be found at nearly any bar worth its salt, there are specialty drinks which are often variants of a popular drink with ingredient tweaks, and there are unique drinks that are unlikely to be found at any bar and will require a recipe if you want one.

Each class is generally named differently. Popular drinks have readily recognizable names. Specialty drinks often have parts of a popular drink name with additional words indicating the variations involved. The unique drinks do not follow a pattern as closely as the other two, reflecting the heterogeneity of people creating these unique drinks.

In order to learn a bit about how words are distributed within drink names I collated all the drink names in my data set and performed some simple analyses. After clean-up, I made a word cloud containing the top 100 words used in drink names, sized according to their frequency. This analysis does not consider the popularity of the drink itself, but the data does contain some indication of popularity (e.g. more popular drinks are more likely to have multiple recipes with the same name).

Booze Cloud!
 
Click to embignify

An interesting follow up to this analysis would be to re-weight the word frequencies according to how popular the drinks they are found in are. This will make the data look more like what you might see on a cocktail menu.

Drink Name Generator


Part of the data exploration I calculated the frequencies of the top 500 words, as well as the distribution of the number of words in a cocktail name. The largest name in my data set was 8 words, with two being the most frequent.

I constrained my drink name generator to produce drink names between 2 and 5 words. The generator generates a random number between 0 and 1, it compares that number to a look-up table corresponding to the cumulative distribution function (cdf) of word number frequencies. That number of words is generated using the same method but from the cdf of word name frequencies.
The generator then returns your new drink name. The app was written using shiny, a way to write small apps using R, which I will probably talk about in a future post.

Here’s the generator!

Make sure to try a few, some are pretty funny, a few gems so far are "Monkey Nut Punch" and "Flaming Panty Sweat".
Post your best in the comments!

Click here to open the app in a new window

Chris

Build Your Bar Project:
Synthesis and Exploratory Data Analysis

Hello readers, this post will be a good one I promise. I’ve started using R Markdown which seems like it will greatly increase the speed with which I can give you analyses.

Diving Back In

We left off last time after having downloaded a ton of html files containing drink recipe data. The first thing to do is to have a look inside the html file. This step is critical for looking at how to pull out the information that we want

Exerpts

Found the drink name
<title> 73 Bus #2 recipe</title> 
Found some key words
<meta content="73 bus #2, 73, bus, #2, gin,<br/>
  triple sec, lime juice, cranberry juice, drink recipe, drink, recipe,<br/>
  alcoholic drink recipe, cocktail recipe, cocktail, mixed drink, martini"<br/> 
  name="keywords">
Found the hierarchy (drink class)
<div class="pm" style="margin-top:20px;"><a href="/cat/1/
  ">Cocktails</a>
  > <a href="/cat/14/">Short drinks</a> 
  > <a href="/cat/141/">by base-ingredient</a> 
  > <a href="/cat/40/">gin-based</a></div>

Pulling out the information

So at this point I’ve identified seven variables I’d like to track for each drink
  1. Drink name
  2. Drink hierarchy(class)
  3. Ingredients
  4. Keywords
  5. Number of ratings
  6. Average rating (out of 10)
  7. URL
In order to get at these values I need to design a regular expression that will only capture the tag of interest. I discussed regular expressions briefly in my previous post, we will rely heavily on the non-greedy quantifier “.+?” I discussed here.

Data Extraction

In order to pull the data out I used R (surprise surprise). I wrote two small accessory functions. One to make string manipulation easier, and one to remove html tags.
#Use either regexpr(default) or grexepr to match elements of interest
#Extract and return them using regmatches
matchPull <- function(pattern, text, invert = FALSE, global = FALSE, ...){
  if(global){
    match <- gregexpr(pattern, text, ...)
  } else {
    match <- regexpr(pattern, text, ...)
  }
  
  pulled <- regmatches(text, match, invert)
  if(length(pulled) == 0) pulled <- NA
  
  pulled
}

# Remove html tags, note the use of the .*? quantifier, 
# a cousin of .+? that can match 0 characters
# Where .+? matches 1 of more.
stripTags <- function(text){
  gsub("<.*?>", "", text, perl = TRUE)
}
First step lets bring in one of the many html files we downloaded and try to extract all of the important data. We’ll use the examples noted above to practice.
fileCon <- file(siteName, blocking = FALSE)
site <- paste0(readLines(con = fileCon), collapse = "\n")
close(fileCon)

# Pull out the whole title, remove the tages
# and remove the word recipe which flanks each recipe name
name <-  matchPull("<title>.*?</title>", site,
                   ignore.case = TRUE, perl = TRUE)
name <- stripTags(name)
name <- sub(" recipe$", "", name, perl = TRUE)
  
# Pull out just the meta tag with the name keywords,
# the pull out the contents, and remove quotes
keywords <- matchPull("<meta content=.*?name=\"keywords\"", site, 
                      ignore.case = TRUE, perl = TRUE)
keywords <- matchPull("\".*?\"", keywords, perl = TRUE)
keywords <- gsub("\"", "", keywords)

# Pull out the division of class "pm" style "yadda-yadda"
# and remove all tags
hierarchy <- matchPull("<div class=\"pm\" style=\"margin-top:20px;\">.*?</div>", 
                       site, ignore.case = TRUE, perl = TRUE)
hierarchy <- stripTags(hierarchy)
Once we’ve figured out how to get all the useful data out of one file, we can encase it in a function that returns one row of data, and apply that to all the files we downloaded (after testing it on a much smaller subset). After that we’ll have a data frame containing all the juicy data, which is much easier to work with. Suppose we’ve encased our processing in {processSite <- function(siteName)} we can apply it many sites all at once by wrapping it in another function
processSites <- function(siteList){
  frameSeed <- processSite(siteList[1])
  drinkFrame <- frameSeed[rep(1, length(siteList)),]
  
  #This sapply structure is basically just a for loop
  sapply(1:length(siteList), function(i){
    drinkFrame[i,] <<- processSite(siteList[i]) 
  })
  
  drinkFrame
}

fileNames <- list.files()#Make sure you've set your working directory before this
fileNames <- fileNames[grepl("\\.html", fileNames)] #grab just .html files

#Try it out on the first 6 files
practiceNames <- head(fileNames)
practiceData <- processSites(practiceNames)

#After inspecting practice data for quality, process them all
drinkData <- processSites(fileNames)
Processing ~17000 sites took R around 6 minutes (wow) on my computer, producing a 10Mb data.frame, which I saved so that I never have to run this code again. Now that the data is in, we can begin with the fun parts. I’ll some skip the quality control steps to get right to the meaty stuff.

Exploratory Data Analysis (EDA)

The cornerstone of any data related project is poking and proding the data to figure out what’s in there. Make some histograms, correlation matrices, and any other simple data visualizations you think might be informative. This step is probably my favourite because it
  1. Helps recognize general patterns
  2. Identifies issues with data quality
  3. It’s fun to watch your data set begin to tell its first story
I’ll present one EDA that I thought was fun. I began getting interested in what people liked to name their drinks, surely there would be some cool patterns in that. I decided I wanted to try my hand at making a word cloud with some of the most common words in drink names. To make a word cloud using R I used the wordcloud package and the tm (text mining) package.
library(wordcloud)
library(tm)

load("drinkData.rda") #Bring in our drinkData from the last step

# Filter out drinks that are neither cocktails nor shots 
# by looking in their hierarchy
drinksFrame <- drinksTable[grepl("(cocktails)|(shots)", 
                                 drinksTable$hierarchy, 
                                 ignore.case = TRUE, 
                                 perl = TRUE),]

# I had a few cases of multiple duplicates, 
# this loop keeps tacks on Alt to duplicated names
# Repeats until no duplicates are found
while(anyDuplicated(drinksFrame$name) != 0){
  dupeNamed <- duplicated(drinksFrame$name)
  drinksFrame$name[dupeNamed] <- 
    paste(drinksFrame$name[dupeNamed],"Alt",sep=" ")
}

#Clean up now empty dupeNamed vector
rm(dupeNamed)

#Convert from character vector to one long string of words
nameVector <- paste0(tolower(drinksFrame$name), collapse = " ") 

# Use tm's built in functions to remove stopwords, see below for a note,
# Also remove alt (because I put it there)
# As well as punctuation. I removed numbers because the site named 
# duplicates with sequential numbers
# And "2" was one of the most popular words 
nameVectorCleaned <- removeWords(nameVector, c(stopwords("english"), "alt"))
nameVectorCleaned <- removePunctuation(nameVectorCleaned)
nameVectorCleaned <- removeNumbers(nameVectorCleaned)

# Split the cleaned string back into a vector of words (tokens really)
# Separated by white space
nameVector <- unlist(strsplit(nameVectorCleaned, " "))

#Then use table to count instances of each word
#Remove number one which was an empty string
#(an unfortunate consequence of our splitting algorithm)
nameFreqs <- table(nameVector)[-1]
nameWords <- as.character(names(nameFreqs))
namesFreqs <- as.numeric(nameFreqs)

freqOrder <- order(nameFreqs, decreasing = TRUE) #Create an ordering vector
top100 <- head(freqOrder, 100) #Indices of the 100 most popular words

#Make a 10 inch by 10 inch pdf to hold the wordcloud  
pdf("boozeNameCloud.pdf", width = 10, height = 10)

#Plot the words, ordered and sized by frequency
wordcloud(nameWords[top100], namesFreqs[top100], 
          scale = c(12,1), random.order = FALSE)

#Close up shop and admire our work
dev.off()
I re-ran the code to make the wordcloud pdf multiple times, because there is something stochastic in the word placement. After a few tries the words aligned and suited my aesthetic tastes. And so I give you

Booze Cloud!

Click to embignify

Wednesday, July 9, 2014

Build Your Bar Project:
Introduction and Data Acquisition

As promised today I'm going to be talking about my "big data" (although I'm beginning to find the term a bit cringe-inducing) project looking at how best to build a bar given a bottle limit or price limit.

Inspiration

I was having lunch last Friday at the bar lounge of a local restaurant, my seat was situated with a nice view of the bar in all it's glory, hundreds of bottles, with representatives of almost every kind of booze imaginable. I love cooking and to a lesser degree, mixology, but I've always wanted to have a small home bar.

My dream bar would be stocked so that visitors will almost always be able to really pick their poison. I began thinking; up on that wall was at a minimum 250 bottles, at an average price of at least $40 per bottle, I would need over $10,000 to replicate it, which is vastly out of my price range. In the spirit of making do, I began to wonder, how many bottles would I really need to provide suitable coverage of all possible drink recipes? Or, if I could only afford so many bottles, how could I maximize the number of possibilities.

The Plan

With the wheels set in motion I began wondering how best to answer these questions. I have seen drinking magazine articles offer nice heuristics on how to choose an ideal micro-bar selection, but being a datahead I decided there must be a data oriented solution.

I knew that there were numerous online recipe catalogs on the internet that I could mine data from, so when I got home, I got to googling. Within a couple of minutes I had chosen a website boasting close to 20000 recipes, and decided it would be my data deposit, keeping with the metaphor.

Data Harvesting

In order to get my hands on that pile of data I needed to make sure the websites were suitable to mine. I began with exploring the directory structure of the website by browsing. I began getting ideas of what recipes I wanted to include in the analysis, punches generally require planning so I didn't want those, and non-alcoholic beverages are a niche my bar isn't means for, so I was sure I was going to exclude those. I decided I was going to spider the website I found, and pull out the recipe page URLs with some regular expressions.

Get Those Spiders Crawling

This was one of my first real experiences attempting to download tens of thousands of websites, so if you have any suggestions for how I could have simplified this stage please leave a comment below. My tool of choice for this step was wget, an incredible free tool for download automation. I used the following code to spider the site:

$ wget -U mozilla -o logArhythm.txt --spider --force-html URL

This step could take a very long time depending on the size of the website. Mine took 4-5 hours.

Flags and Arguments

Flag Argument(s) Notes
-U an identity for the spiderServer drops connections with spiders,
so masquerade as a firefox user
-otext file of choice Where to put a log file containing output
--spider None Deactivates default behaviour of downloading all files, recursively follows links within the site of interest allowing the directory structure of the site to be discovered
--force-html NoneTreat all found files as html to facillitate crawling
URL Location of site to download

Process the log file


So now we have a completed log file containing wget's default output. The next step was to load log file into R, my favourite scripting language, and extract just URLs, ignoring all the extra output wget provides.

#Read in text from a file, split into character (strings) by whitespace
logText <- scan("logArhythm.txt", what = "character)


Scan returns a vector of strings delimited by white space.  For each page visited by my spider the log file contains several lines of output. Conveniently, the log only contains the full URL once for each visited page, this makes it easy to find the URLs with minimal redundancy. The spider can visit the same page multiple times, so duplicate URLs are removed with a call to unique.

URLs <- logText[grepl("http", logText)]
URLs <- unique(URLs)

The variable URLs now contains the location of each page on the site, I just want drink recipes. All drink recipe pages have the terminal portion of the path "/drink[code].html" with [code] being anywhere from 1 - 6 alphanumeric digits. To identify URLs that correspond to recipe pages I used a simple regular expression. See the footnotes for a description of the perl-style regular expression ".+?"1

isDrinkURL <- grepl("drink.+?", URLs. perl = TRUE)
drinkURLs <- URLs[isDrinkURL]
write(drinkURLs, "drinkSiteList.txt)


Now we have a text file containing every website within the domain I examined that ends in "/drink[Some Code].html", this file is perfect for the next step. Wget can take an input file containing a list of websites to download. Make sure you run wget in a folder that you don't mind filling with a ton of files.

$ wget -i drinkSiteList.txt -U mozilla

I left this download running over night and woke up to the wonderful present of 60 megs of raw html data, lo and behold only the drink recipe sites had been downloaded, the steps up until this stage had been successful. Please be conscientious about executing code like this, unchecked use of wget can be very taxing on a server and on your own bandwidth. Wget comes with many safeguards including download size limits, download rate limitation, and some other tools to avoid damaging a site that you like enough to want to borrow data from.

As I am still new to blogging this post took an excruciating amount of time to compose and so I'm going to end it here. In the coming weeks I hope to master html and perhaps some productivity tools to reduce the production time of a work like this so I can provide you with lengthier analyses.

Please stay tuned for more on this project. I already have some exploratory analyses completed including a word cloud. and if I can make it happen there will be a random drink name generator to give you ideas for your next famous party cocktail.

Chris


1 R supports several types of regular expressions, one of the most powerful being Perl like. The regular expression ".+?" means give me any character {.} and I want at least one, but possibly more {+} and choose the fewest number of characters that match {?}. In more technical terms, plus is an indefinite quantifier meaning it can capture different numbers of characters, the default behaviour is greedy, it tries to match as many characters as possible, the question mark changes the behaviour to non-greedy. In this case it isn't necessary to specify non-greedy matching, but when matching tags later in the analysis it is integral. For example "<p>.+</p>" when searching an html document would match from the beginning of the first paragraph to the end of the last one where "<p>.+?</p>" would only match the first paragraph

Monday, July 7, 2014

Hello World!
Introductions and Caveats

Today marks the start of my datamancy blog. The purpose of this blog is to serve as an academic jot-pad/notebook to contain my thoughts primarily on data science, bioinformatics, technology, and a healthy serving of other trivia that I encounter in my work and studies. I will try to avoid complete self-indulgence but at the present moment the content presented here is for myself primarily. However1, I hope anyone who stumbles upon this site will find something interesting2.

Through this blog I will hopefully present a number of hobby analyses I have and will be conducting with code and instructions for interested readers. I also plan to present of lay-person explanations of "big-data" techniques I have been and will be using. I find simplifying and writing-out algorithms has always helped me to understand them, and my explanations may help you.

I have chosen the term datamancy for my blog which requires some explanation. The suffix -mancy is used to indicate divination by means of whatever precedes it, in this case data. I believe that one of the most important concerns for anyone involved with data is the ability to accurately predict the the behavior of a system in the future or under some potential conditions, in other words, divination through data.

Well that concludes my introduction and caveats, I hope you'll stay tuned for more content, or check out newer posts if you are reading this in the future. The next post will be an introduction to my data-mining and analysis project looking at how best to stock a bar given a bottle limit!

Chris


1 So I just checked to make sure it's generally appropriate to use "however" to start a sentence, according to "The Elements of Style" it is not, but more modern style guides say it's ok. Check out this link for more details

2 For example, check out footnote 1 for a neat discussion of when and how to use the word "however". the more you know!