MapReduce is one of those simple ideas that you look back on and say, well damn, I could have thought of that. MapReduce is a simple technology that allows programmers to write two simple functions, a mapper and a reducer, and have them scale to attack very large problems. I was recently needing to run some MonteCarlo experiments involving various graph algorithms over random graphs. Luckily I had access to some compute time on a very large MapReduce system. I haven’t seen much discussion on putting them together. So, this article describes how to use MapReduce as a tool for MonteCarlo simulations.

MapReduce

Most of you already know a bit about them, but just a reminder. MapReduce jobs consist of two main phases: Map and Reduce. To begin, some large body of data is broken into manageable pieces (say, one line of text from a huge file, or a single HTML file from the entire internet). Each of these pieces is handed to a Mapper. The Mapper takes this piece of the larger data and emits key-value pairs.

Before theses key-value pairs are passed to the Reducers, they are sorted and the keys are aggregated. A reduce instance will get all of the values associated with a single key. It will then compute some new value from this, emitting an answer (also as a key-value pair).

One of the canonical examples of how this works is to calculate word-frequencies from some very large body of text. What follows is some Python-esque pseudo code for doing just this.

def Mapper(line):
    for x in line.split(" "):
        emit(x, 1)

This takes each word and emits it as the key with a value of one. Now, each of these are aggregated and passed to Reducer instances. So, for example, the following Reducer might be called with a list of values associated with the key “the”:

def Reducer(key, values):
    count = 0
    for v in values:
        count += v
    emit(key, count)

Now, the output after the Reducer has run is the frequency for every word occurring in the data-set.

MonteCarlo

The basic idea of a MonteCarlo experiment is that you’re trying to calculate the value of some underlying function or variable, and you do so by repeated random experiment.

For example, you can calculate pi easily using a MonteCarlo experiment. Imagine you have a dart board with radius 1 meter (big dartboard). It’s contained within a perfectly fitting square, thus it’s 2m on a side. If we throw a dart, what’s the probability that we hit the dart board, given that we hit within the square? It’s the area of the board divided by the total area of the square, so .

To make the math easier, let’s just look at the upper right quadrant. Thus, \(x\in[0,1], y\in[0,1]\). The area within the square in this quadrant is 1, and the area of the circle is \(\pi/4\). Note that the probability of hitting the dart board is the same. We can parameterize the circle as \(x^2 + y^2 \leq 1\). So, let’s set up our experiment like this.

\[ hit(x,y) = \{\begin{aligned} 1 & & \text{if }x^2 + y^2 \leq 1 \\ 0 & & otherwise \end{aligned} \]

Now, \(hit(x,y)\) is an indicator random variable. Thus, \(\text{E}(hit(x,y)) = p\) where \(p\) is the probability of hit(x,y) = 1, which we already stated is \(\pi/4\). So what? The expected value of this function is \(\pi/4\), who cares? Well, this means that if we uniformly choose numbers between 0 and 1 and pass them through hit(x,y), and figure out what the experimental average number of hits is, this should converge to the expected value. In other words. Throw thousands of darts and find the ratio of hits to total throws. This will tend toward \(\pi/4\).

Bringing it together

Because our random numbers are drawn with replacement, there is no real difference in running 10 experiments with 100 darts and 1 experiment with 1000 darts. But, what do we pass to the Mapper?

When we teach undergraduates about random number generators, we show them the pitfalls of not seeding a random number generator. Then, we show them the easy way to correct this, which is to just use the current time as the seed. Also here, we also want our experiment to be repeatable.

So, instead of using current machine time, let’s seed each experiment with a different number. Thus, the input to our MapReduce job is just going to be the seeds. For ours, lets just make the input be \([1, \ldots, 100]\). And, say that each Mapper only gets a single input. The Mapper is where the MonteCarlo experiment actually happens. Let each mapper run 100 sims (not necessary, but why not?):

import random
def Mapper(s):
    random.seed([s])
    hitCounter = 0
    for i in range(100):
        x = random.random()
        y = random.random()
        if x*x + y*y <= 1:
            hitCounter += 1
    emit(1, hitCounter)

Of note here is that the key being emitted is 1 for all of the Mappers. This means that only a single reducer will run. The reducer need just aggregate these counts:

def Reducer(key, values):
    totalHits = 0
    for v in values:
        totalHits += v
    # each value is out of 100, to total is 100*|values|
    emit(key, totalHits / (100. * len(values))

Because there's only only one Reducer, there's only one answer from the whole experiment, the experimental average of the number of hits. This should converge to \(\pi/4\) as the number of experiments increases.

Conclusion

Sure calculating \(\pi\) like this is serious overkill. Yes, there are much more efficient methods. But the point is that this is a fairly easy technique.

I used this with Erdős–Rényi random graphs with varying edge-probabilities. So, in addition to the seed, the input to my Mappers also included the edge probability. So, the idea is that you design a single Monte Carlo experiment with paramerized variables and pass those in to the Mappers.

This made my life much, much easier. I was able to use several hundred CPU hours of experimentation but with only a few hours of wall time. With Amazon's Elastic MapReduce instances only being $0.015 per CPU hour, this is definitely worth putting some thought into.