About a week ago my friend Garrett showed me this wonderful art tumblr called Movie Barcode. In it, an anonymous curator posts anamorphic stills generated by stitching together single-pixel-wide bars captured from the film frames. I spent a few hours putting together The Movie Barcode Quiz. If you want to look at the code, here is the github project.
I wanted to share a few details about the tools I used to build this project.
Getting the Data
I’m a data hoarder, and have been building web scrapers for about as long as I’ve been programming. I think writing a scraping script is a good way to dive into a new programming language because it involves a healthy amount of library and paradigm coverage. For web scraping, you’ll need to figure out how to fetch the URL, traverse the data to get what you want, model and sanitize it, then store it.
In the past, scraping has been painful.
I love data, but I do not like writing data scrapers. The process is painful, time consuming, and requires a lot of maintenance. Ain’t nobody got time.
There are a lot of frameworks in a lot of languages which, depending on your needs, may ameliorate some of the pain of building and maintaining a scraper. At PyCon 2014 I got to see a demo of Portia, which is an intelligent GUI built atop scrapy. However, like many other projects in the Python world, scrapy and Portia feel heavy handed for what they actually accomplish. I still use them and have a dedicated vagrant sandbox for projects requiring more granular scraping tools, but the overhead involved in setting up a new scrapers has dissuaded me from starting many new data projects. I wanted to try something lighter.
Web Scraper (Chrome plugin)
It makes sense to me that your web scraper live in the browser along with the data, and I’ve wanted to try one of the browser plugins I’ve been hearing about. The impeccably named Chrome plugin, Web Scraper is a perfect tool for getting data out of the browser and into a CSV file. For my purposes I had an excellently annotated list of film titles + years, along with links to their movie barcode images on the movie barcode index page. Using Web Scraper’s hierarchy tool, it was intuitive to describe to the crawler that it should open each of the links and save the image src. After about an hour and a half I had a nice big CSV with over 16k film titles and image urls.
Parse as a backend
Since the end result is going to be a (literally) Single Page App running on Angular, I don’t need a complex back end. In fact, the only data I’m transmitting are image URLs and movie titles. Theoretically, I could just put all of this data in a static file and download it with the rest of the assets, and there are some cases where that might make sense.
For this app I’m using Parse as my back end, which has good and bad features. Among the good for this project are its high availability, its APIs, and its price (free-ish).
Randomness with Parse
Second, Parse has a limited query language. It’s built on MongoDB but only exposes a subset of Mongo’s capabilities. The most glaring shortcoming as far as this app is concerned is the complete lack of a “find random” function. Getting random rows is basically all I need this backend to do. Oy.
The Web App (Angular.js)
To set up this project I used the Yeoman angular generator, and its awesome bower + grunt integration for live reloading and building.