In Februrary, I went to the Twin Cities Hadoop User Group meetup too see Donald Miner speak. Donald used the session to go over some examples from his book MapReduce Design Patterns. The examples in the book are in Hadoop but are directly usable with other db’s that use MapReduce. I wanted to try out some MapReduce patterns but didn’t want to install a Hadoop cluster at home. Alternatively, MongoDB is very easy to download and install on a dev machine.
The next morning I had this thought…
Goal: dump stack exchange data into MongoDB so I can mess around with MapReduce.
This is a simple task for an experienced dev on their chosen stack. My main back-end stack is .NET, but I’ve been doing a lot of iOS work and don’t boot into Windows that often. I figured this was a good chance to start exploring Ruby.
Step 1: Download Stack Exchange Data Dump. Get a torrent.
Step 2: Un-pack archive with 7zX
Step 3: Learn Ruby
Step 5: Put it in MongoDB
Step 6: There is no step 4?
Anyway, I enjoyed messing around with Ruby and thought others might want to do the same thing. Bitbucket link here…
Usage:
ruby stack-mongo.rb directoryName mongoDatabaseName
For other Ruby newbs. You will need to install ruby gems and download the mongo and crack gems. This is tested on Ruby version 1.8.7 (since that’s what came installed on my MacBook). I’d recommend starting with a smaller StackExchange to begin with – for reasonable load and batch times. I used English. This script assumes you have a running MongoDB instance on localhost. Modification to the connection string is left as an exercise for the reader.
tl;dr Here’s how to load the Stack Enchange data dump into MongoDB. I just spent hours doing this in Ruby because I was too lazy to boot into Windows.