Datamancy: Build Your Bar Project: Introduction and Data Acquisition

As promised today I'm going to be talking about my "big data" (although I'm beginning to find the term a bit cringe-inducing) project looking at how best to build a bar given a bottle limit or price limit.

Inspiration

I was having lunch last Friday at the bar lounge of a local restaurant, my seat was situated with a nice view of the bar in all it's glory, hundreds of bottles, with representatives of almost every kind of booze imaginable. I love cooking and to a lesser degree, mixology, but I've always wanted to have a small home bar.

My dream bar would be stocked so that visitors will almost always be able to really pick their poison. I began thinking; up on that wall was at a minimum 250 bottles, at an average price of at least $40 per bottle, I would need over $10,000 to replicate it, which is vastly out of my price range. In the spirit of making do, I began to wonder, how many bottles would I really need to provide suitable coverage of all possible drink recipes? Or, if I could only afford so many bottles, how could I maximize the number of possibilities.

The Plan

With the wheels set in motion I began wondering how best to answer these questions. I have seen drinking magazine articles offer nice heuristics on how to choose an ideal micro-bar selection, but being a datahead I decided there must be a data oriented solution.

I knew that there were numerous online recipe catalogs on the internet that I could mine data from, so when I got home, I got to googling. Within a couple of minutes I had chosen a website boasting close to 20000 recipes, and decided it would be my data deposit, keeping with the metaphor.

Data Harvesting

In order to get my hands on that pile of data I needed to make sure the websites were suitable to mine. I began with exploring the directory structure of the website by browsing. I began getting ideas of what recipes I wanted to include in the analysis, punches generally require planning so I didn't want those, and non-alcoholic beverages are a niche my bar isn't means for, so I was sure I was going to exclude those. I decided I was going to spider the website I found, and pull out the recipe page URLs with some regular expressions.

Get Those Spiders Crawling

This was one of my first real experiences attempting to download tens of thousands of websites, so if you have any suggestions for how I could have simplified this stage please leave a comment below. My tool of choice for this step was wget, an incredible free tool for download automation. I used the following code to spider the site:


$ wget -U mozilla -o logArhythm.txt --spider --force-html URL

This step could take a very long time depending on the size of the website. Mine took 4-5 hours.

Flags and Arguments

Flag	Argument(s)	Notes
-U	an identity for the spider	Server drops connections with spiders, so masquerade as a firefox user
-o	text file of choice	Where to put a log file containing output
--spider	None	Deactivates default behaviour of downloading all files, recursively follows links within the site of interest allowing the directory structure of the site to be discovered
--force-html	None	Treat all found files as html to facillitate crawling
	URL	Location of site to download

Process the log file

So now we have a completed log file containing wget's default output. The next step was to load log file into R, my favourite scripting language, and extract just URLs, ignoring all the extra output wget provides.

#Read in text from a file, split into character (strings) by whitespace 

logText <- scan("logArhythm.txt", what = "character)

Scan returns a vector of strings delimited by white space. For each page visited by my spider the log file contains several lines of output. Conveniently, the log only contains the full URL once for each visited page, this makes it easy to find the URLs with minimal redundancy. The spider can visit the same page multiple times, so duplicate URLs are removed with a call to unique.

URLs <- logText[grepl("http", logText)]
URLs <- unique(URLs)

The variable URLs now contains the location of each page on the site, I just want drink recipes. All drink recipe pages have the terminal portion of the path "/drink[code].html" with [code] being anywhere from 1 - 6 alphanumeric digits. To identify URLs that correspond to recipe pages I used a simple regular expression. See the footnotes for a description of the perl-style regular expression ".+?"¹


isDrinkURL <- grepl("drink.+?", URLs. perl = TRUE)

drinkURLs <- URLs[isDrinkURL]

write(drinkURLs, "drinkSiteList.txt)

Now we have a text file containing every website within the domain I examined that ends in "/drink[Some Code].html", this file is perfect for the next step. Wget can take an input file containing a list of websites to download. Make sure you run wget in a folder that you don't mind filling with a ton of files.

$ wget -i drinkSiteList.txt -U mozilla

I left this download running over night and woke up to the wonderful present of 60 megs of raw html data, lo and behold only the drink recipe sites had been downloaded, the steps up until this stage had been successful. Please be conscientious about executing code like this, unchecked use of wget can be very taxing on a server and on your own bandwidth. Wget comes with many safeguards including download size limits, download rate limitation, and some other tools to avoid damaging a site that you like enough to want to borrow data from.

As I am still new to blogging this post took an excruciating amount of time to compose and so I'm going to end it here. In the coming weeks I hope to master html and perhaps some productivity tools to reduce the production time of a work like this so I can provide you with lengthier analyses.

Please stay tuned for more on this project. I already have some exploratory analyses completed including a word cloud. and if I can make it happen there will be a random drink name generator to give you ideas for your next famous party cocktail.

Chris

1 R supports several types of regular expressions, one of the most powerful being Perl like. The regular expression ".+?" means give me any character {.} and I want at least one, but possibly more {+} and choose the fewest number of characters that match {?}. In more technical terms, plus is an indefinite quantifier meaning it can capture different numbers of characters, the default behaviour is greedy, it tries to match as many characters as possible, the question mark changes the behaviour to non-greedy. In this case it isn't necessary to specify non-greedy matching, but when matching tags later in the analysis it is integral. For example "<p>.+</p>" when searching an html document would match from the beginning of the first paragraph to the end of the last one where "<p>.+?</p>" would only match the first paragraph^↩

Datamancy

Pages

Wednesday, July 9, 2014

Build Your Bar Project:
Introduction and Data Acquisition

Inspiration

The Plan

Data Harvesting

Get Those Spiders Crawling

Flags and Arguments

Process the log file

No comments:

Post a Comment

Blog Archive