March 12, 2011

Khaaaaaaaaaan! The TED Talk

Filed under: blog — Nathan @ 8:31 pm



This video really makes me miss teaching. I think maybe I’ll start the
“Wiegand Academy” of Computer Science classes. Any opinions one way or the other?

May 5, 2010

Taking MapReduce to Monte Carlo

Filed under: Programming — Tags: , , — Nathan @ 4:20 pm

MapReduce is one of those simple ideas that you look back on and say, well damn, I could have thought of that. MapReduce is a simple technology that allows programmers to write two simple functions, a mapper and a reducer, and have them scale to attack very large problems. I was recently needing to run some MonteCarlo experiments involving various graph algorithms over random graphs. Luckily I had access to some compute time on a very large MapReduce system. I haven’t seen much discussion on putting them together. So, this article describes how to use MapReduce as a tool for MonteCarlo simulations.

MapReduce

Most of you already know a bit about them, but just a reminder. MapReduce jobs consist of two main phases: Map and Reduce. To begin, some large body of data is broken into manageable pieces (say, one line of text from a huge file, or a single HTML file from the entire internet). Each of these pieces is handed to a Mapper. The Mapper takes this piece of the larger data and emits key-value pairs.

Before theses key-value pairs are passed to the Reducers, they are sorted and the keys are aggregated. A reduce instance will get all of the values associated with a single key. It will then compute some new value from this, emitting an answer (also as a key-value pair).

One of the canonical examples of how this works is to calculate word-frequencies from some very large body of text. What follows is some Python-esque pseudo code for doing just this.

def Mapper(line):
    for x in line.split(" "):
        emit(x, 1)

This takes each word and emits it as the key with a value of one. Now, each of these are aggregated and passed to Reducer instances. So, for example, the following Reducer might be called with a list of values associated with the key “the”:

def Reducer(key, values):
    count = 0
    for v in values:
        count += v
    emit(key, count)

Now, the output after the Reducer has run is the frequency for every word occurring in the data-set.

MonteCarlo

Monte Carlo
The basic idea of a MonteCarlo experiment is that you’re trying to calculate the value of some underlying function or variable, and you do so by repeated random experiment.

For example, you can calculate pi easily using a MonteCarlo experiment. Imagine you have a dart board with radius 1 meter (big dartboard). It’s contained within a perfectly fitting square, thus it’s 2m on a side. If we throw a dart, what’s the probability that we hit the dart board, given that we hit within the square? It’s the area of the board divided by the total area of the square, so $\pi/4$.

To make the math easier, let’s just look at the upper right quadrant. Thus, $x\in[0,1], y\in[0,1]$. The area within the square in this quadrant is 1, and the area of the circle is $\pi/4$. Note that the probability of hitting the dart board is the same. We can parameterize the circle as $x^2 + y^2 \leq 1$. So, let’s set up our experiment like this.


$hit(x,y) = \left\{\begin{aligned} 1 & & \text{if }x^2 + y^2 \leq 1 \\ 0 & & otherwise \end{aligned} \right.$

Now, $hit(x,y)$ is an indicator random variable. Thus, $\text{E}(hit(x,y)) = p$ where $p$ is the probability of $hit(x,y) = 1$, which we already stated is $\pi/4$. So what? The expected value of this function is $\pi/4$, who cares? Well, this means that if we uniformly choose numbers between 0 and 1 and pass them through $hit(x,y)$, and figure out what the experimental average number of hits is, this should converge to the expected value. In other words. Throw thousands of darts and find the ratio of hits to total throws. This will tend toward $\pi/4$.

Bringing it together

Because our random numbers are drawn with replacement, there is no real difference in running 10 experiments with 100 darts and 1 experiment with 1000 darts. But, what do we pass to the Mapper?

When we teach undergraduates about random number generators, we show them the pitfalls of not seeding a random number generator. Then, we show them the easy way to correct this, which is to just use the current time as the seed. Also here, we also want our experiment to be repeatable.

So, instead of using current machine time, let’s seed each experiment with a different number. Thus, the input to our MapReduce job is just going to be the seeds. For ours, lets just make the input be $[1, \ldots, 100]$. And, say that each Mapper only gets a single input. The Mapper is where the MonteCarlo experiment actually happens. Let each mapper run 100 sims (not necessary, but why not?):

import random
def Mapper(s):
    random.seed([s])
    hitCounter = 0
    for i in range(100):
        x = random.random()
        y = random.random()
        if x*x + y*y <= 1:
            hitCounter += 1
    emit(1, hitCounter)

Of note here is that the key being emitted is 1 for all of the Mappers. This means that only a single reducer will run. The reducer need just aggregate these counts:

def Reducer(key, values):
    totalHits = 0
    for v in values:
        totalHits += v
    # each value is out of 100, to total is 100*|values|
    emit(key, totalHits / (100. * len(values))

Because there's only only one Reducer, there's only one answer from the whole experiment, the experimental average of the number of hits. This should converge to $\pi/4$ as the number of experiments increases.

Conclusion

Sure calculating $\pi $ like this is serious overkill. Yes, there are much more efficient methods. But the point is that this is a fairly easy technique.

I used this with Erdős–Rényi random graphs with varying edge-probabilities. So, in addition to the seed, the input to my Mappers also included the edge probability. So, the idea is that you design a single Monte Carlo experiment with paramerized variables and pass those in to the Mappers.

This made my life much, much easier. I was able to use several hundred CPU hours of experimentation but with only a few hours of wall time. With Amazon's Elastic MapReduce instances only being $0.015 per CPU hour, this is definitely worth putting some thought into.

April 21, 2010

Lessons I should have learned, Episode 4: Coroutines in C

Filed under: Lessons I should have learned,Programming — Tags: , , — Nathan @ 2:19 pm

A while ago, I read this article on coroutines in C. One of the main downfalls of this technique is that since it uses the static keyword, it makes it much harder to write recursive coroutines, and nigh-impossible to have multiple coroutines using the same function.

Examples

Let’s first look at an example of how it can work. One way of efficiently computing Fibonacci numbers in a language like Haskell is to construct an infinite list of the Fibonacci values.

fibs a b = a : fibs b (a+b)

Since Haskell is lazily evaluated, such an infinite list is fine until you attempt to compute the whole thing. So, as long as you just say something like take 10 $ fibs 1 1 it will only compute how many you need.

One important use of Coroutines is as iterators. Consider the Haskell example from above. If we assume that we have such an infinite list of values, we may want to iterate over the list, inspecting the values as we go. The following C code achieves this using some helper macros we’ll describe below.

int fibs(int a, int b, CoroutineState *s) {
  initializeCoroutineState(s, FibState, mystate);

  mystate->a = a;
  mystate->b = b;
  yield(s,mystate->a);
  recur (s,fibs(mystate->b, mystate->a + mystate->b, s->next));

  finalizeCoroutine(s);
  return 0;
}

Now, if we wanted to print out the first 10 of these values:

s = createCoroutineState();
for(i = 0; i < 10; i++) {
  printf("%d ", fibs(1,1,s));
}

The yield method is similar to Python's yield key word. It returns a value, but also records where the yield was called so that the next time the function is executed it can begin again from that point.

Another example of how coroutines can be used as iterators is as iterators over a binary tree. Consider the following tree:

       4
  2         6
1   3     5    7 

A standard way to traverse this tree is a post-order traversal. Consider the following code which traverses this tree and prints it out:

void print(Tree *t) {
  if(t->left)
    print(t->left);
  if(t->right)
    print(t->right);
  printf("%d ", t->value);
}

Building a post-order iterator is slightly more complex. It involves creating a stack and pushing wrapped tree-nodes onto the stack. When one is popped, it is marked as 'visited' and then pushed onto the stack, followed by its right child then left child. However, we can use coroutines to achieve the same thing while staying true to the original traversal code:

int treePostOrderIterator(Tree *t, CoroutineState *s) {
  initializeCoroutineState(s,TreeIteratorState, mystate);
  mystate->tree = t;

  if(mystate->tree->left)
    recur (s, treePostOrderIterator(mystate->tree->left, s->next));

  if(mystate->tree->right)
    recur (s, treePostOrderIterator(mystate->tree->right, s->next));

  yield (s,mystate->tree->value);

  finalizeCoroutine(s);
  return 0;
}

How it works

When I was learning to program, I learned that switch statements were just conditional structures that were equivalent to a series of if statements. While you can almost always get away with thinking about them this way, they aren't actually. Instead, think of switch statements as parameterized gotos. One of the less objectionable uses of goto is as a means of breaking out of nested loops. So, obviously goto can jump out of scopes. In fact, it can be used to jump into and out of any scope inside of a single function. The same is true of the cases of a switch statement. This is what makes Duff's device possible.

In order to facilitate recursion and having multiple coroutines running at once, we need to pass a mutable state parameter. This is an object of type struct CoroutineState.

typedef struct CoroutineState CoroutineState;
struct CoroutineState {
  void *state;            /* Pointer to a struct representing the internal state
                             of the coroutine.  We need this since we have to
                             reinstate the scope whenever the function is
                             re-intered.                                      */
  long current;           /* The line number of where the coroutine should
                             return.  Used in 'yield' and 'recur'             */
  unsigned char done;     /* A marker to show that a recurrence is completed  */
  CoroutineState *next;   /* The next in the stack of states                  */
  CoroutineState *parent; /* NULL if actually the parent, pointer otherwise   */
  CoroutineState *tail;   /* Only the parent is guaranteed to know the tail   */
};

Basically this struct stores everything needed to reinstate the coroutine when it is next called. This includes the line number it should begin executing on, a pointer to the user's saved values, and a stack containing the recursive calls.

Every coroutine begins with an invocation of the initializeCoroutineState macro. This is a macro that takes three parameters. The first is a CoroutineState object, the second is the type defining the user's saved values, and the last is the variable name to use to refer to this state.

#define initializeCoroutineState(coroutineState, UserStateType, userStateVar) \
  UserStateType *userStateVar = (UserStateType*) coroutineState->state;       \
  switch(coroutineState->current) {                                           \
  case 0:                                                                     \
  if(coroutineState->state == 0) {                                            \
    coroutineState->state = (void*) malloc(sizeof (UserStateType));           \
    bzero(coroutineState->state, sizeof(UserStateType));                      \
  }                                                                           \
  userStateVar = (UserStateType*) coroutineState->state;                      \

Note that there is a switch statement embedded here. This switch statement jumps to the appropriate line in the execution.

Every coroutine ends with finalizeCoroutine(s) which indicates that after this point the execution of the function should proceed normally. It also says that when this function returns that the coroutine is complete and should be removed from the stack. This is indicated internally with s->done =1;.

#define finalizeCoroutine(s)                                                  \
  }                                                                           \
  s->done = 1;                                                                \
  free(s->next);                                                              \
  s->next = 0;

Now, to the meat of the matter. yield needs to return the value the user wants to yield, but also record where this occurred so that the switch statement from above will jump to here. It also needs to reinstate any variables that the user had in scope. Unfortunately there's no good way of doing that in C, so we have to do it manually. Thus the need for the helper struct. Before the return in the following macro, we're just manipulating the stack so that we can record where we are. Notice coroutineState->current = __LINE__;. __LINE__ is a macro that evaluates to the current line number in the program. So, here we're recording the line number and then after the return, we have case __LINE__:. This will cause the coroutine to begin executing after the return the next time it's called.

#define yield(coroutineState, value)                                          \
  do{                                                                         \
    CoroutineState *parent = coroutineState ? coroutineState                  \
        : coroutineState->parent;                                             \
    coroutineState->current = __LINE__;                                       \
    parent->tail = coroutineState;                                            \
    free(coroutineState->next);                                               \
    coroutineState->next = 0;                                                 \
    return value;                                                             \
    case __LINE__: 1;                                                         \
  } while(0)                                                             

The last macro is perhaps the most interesting. It's the one that allows things like the infinite list iterator and the binary tree iterator. Basically, there are times when you want to have a recursive call and have the coroutine state stored along with having a return value bubble up.

The while loop is responsible for executing until the coroutine called below it is completed.

#define recur(coroutineState,func)                                            \
  do{                                                                         \
    CoroutineState *parent = coroutineState ? coroutineState                  \
        : coroutineState->parent;                                             \
    coroutineState->next = createCoroutineState();                            \
    coroutineState->next->parent = parent;                                    \
    parent->tail = coroutineState->next;                                      \
    while(!coroutineState->next->done) {                                      \
      CoroutineState *parent = coroutineState ? coroutineState                \
          : coroutineState->parent;                                           \
      typeof(func) tmp =  func;                                               \
      coroutineState->current = __LINE__;                                     \
      if(!parent->tail->done)                                                 \
        return tmp;                                                           \
      case __LINE__: 1;                                                       \
    }                                                                         \
  } while(0)                                                                  \

Using it

In order to get this to work, you have to do one extra thing. You need to create a struct which you use to store your state rather than use local-variables in your functions. This is necessary since we'll be re-entering the functions later and that state would be lost. So, just put all the variables you would want to use into the struct. Then use this struct in the initializeCoroutineState and then whenever you use a local variable.

If you would normally return a computed value (not the result of a recursive call) then just say yield (value); If you are doing a recursive call and intend for the coroutine to proceed down the recurrence, just wrap it in a recur(s, yourRecursiveCall()).

Get it

I've posted the implementation as a library on GitHub: http://github.com/nathanwiegand/coroutines.

April 11, 2010

The ugliest code I ever wrote: Parser generator written in XSLT

Filed under: Programming — Tags: , , , , — Nathan @ 4:41 pm

I took a compiler design course in the Spring semester of 2004. I cannot recommend enough that you should take a course like this at some point in your academic career. I hear more and more about how departments are removing this as a requirement, or making the course easier. I think this is a real shame, but that’s a debate for another day. This post is about my semester project for Compilers.

The assignment was to write a compiler for a subset of Pascal that generated bytecode for a VM that we were provided. Our compiler was to be written in C/C++, and we weren’t to use Lex/Yacc. So, my teammate Charles and I set upon breaking down the tasks. He wrote the lexer, and I wrote the parser.

First, a quick note. It was 2004. Things were weird in 2004. We were in the midst of several wars. Facebook was founded. Cassini reached Saturn. And, XML was big. At least in my group of friends it was BIG. So, when it came time to build the the parse tree, the obvious choice was to just use LibXML and use an XML tree. This let me trivially dump the entire tree as an XML doc. I then wrote an XSLT (even remember it?) to transform the XML to Graphviz Dot so I could visualize the parse tree. This was my version of testing.

After quite a few iterations, I had plenty of time until the deadline so I set out working on the obvious: a parser generator. We weren’t allowed to use Yacc, but no one said anything about writing our own.

Luckily the language we were parsing was quite simple so my parser was just recursive-descent. Thus, it was fairly easy to generalize it. So, what language does one write a parser generator in? XSLT of course. The choice was motivated by #1 someone told me I couldn’t do it and #2 there were serious wins to representing the grammar itself in XML.

So, how did it work? It took a grammar represented in XML. Then an XSLT transformed that grammar into a C parser for that language which then produced the parse tree as an XML document. But, a compiler doesn’t end at parsing. The type-checking step was also done in XSLT, but over a decorated AST instead. Unfortunately we were running low on time (down to mere hours) in the end so we ended up writing the code generator in C instead of XSLT.

Here’s some gems from the parser generator:

This bit of code is near the top of the file. Notice what it does.



    
    true
    
    
    

It calls another template, much like a function that can’t return anything since it can only emit text/elements, with the parameter “prototype = true”. So, looking at the beginning of functions:


    
    
    
int
parse
(xmlNode* par)
    ;
    
{
    int didSomething=0;
    
    ...

So, functions is responsible for both emitting the code for each parse function and also for making the prototypes.

Sadly, it’s been 6 years so I don’t remember enough about the code to do anything with it these days. And like all school projects, documentation was considered a sign of weakness or some such. So, the only artifacts are the source files themselves.

Without further ado, several of the files from our compiler:
parser-generator.xsl
grammar.xml
type-checker.xml

April 7, 2010

Introducing Funion

Filed under: Programming — Tags: , — Nathan @ 1:41 pm

I defended my dissertation on March 26. So, I guess that means I’m Dr. Nathan now. I’m not sure what normal people do the week after their defense, but I’m doing sprint after sprint on personal projects. I start work in the middle of July and I hear that this company is known as a black hole. There’s just too many cool things internally to learn so I probably won’t have much time for personal projects for a good while. Thus, the first of my projects that’s ready for the world: Funion.

Funion

Funion (Fuse Union /fʌn:njən/) is a Haskell program that uses the HFuse binding to the FUSE library. It presents a filesystem which shows a unioned view of multiple directories.

Example

Consider the case where you have the following two directories, A and B.




But, A and B represent two instances of some common hierarchical scheme. So, unioning A and B, we get:

Get Funion

You can get Funion two ways: GitHub and Hackage

If you have cabal, you can just do:


  cabal install funion


Installation

Funion requires: *nix, FUSE, and GHC (preferably the Haskell Platform).

If you have Ubuntu, install the Haskell Platform. Then, to install FUSE:


  sudo apt-get install libfuse-dev

Now, you can just cabal install funion.

Usage

Create a mountpoint. Then use funion to union several dirs into that mountpoint:


  mkdir mountpoint
  funion mountpoint /A /B

where /A and /B are teh directories to union.

Upcoming features

Currently Funion is readonly. The next version will allow file editing, and possibly file creation. If a file is created it will have a specific file system to be placed in. Or perhaps, there could be a round-robin file-writing scheme.

There is no logging as of now. The next version will have a logging mode.

FunionD: A daemon which will allow you to mount your funioned directories upon boot.

March 16, 2010

Profiling C with Haskell

Filed under: Programming — Tags: , , , , — Nathan @ 11:33 pm

I had the great pleasure of attending the ICFP in 2009. I’m pretty new to the functional programming game, so it was really cool to meet the movers and shakers. I got to see Bryan O’Sullivan give this presentation on Criterion — a “robust, reliable performance measurement and analysis” package for analyzing Haskell programs. You can read Bryan’s description here.

When I first saw the Criterion presentation, I immediately thought about porting this to C. Unfortunately, I’ve been writing a dissertation and I haven’t gotten around to it yet. However, I did decide to do something that some might consider unholy. I’ve used Criterion to profile my C functions.

How to profile C with Criterion

For this example, I’m going to profile a Fibonacci heap that I’ve been working on.

We’re going to use Haskell’s Foreign Function Interface (FFI). So, let’s start with a simple header file for our test function:

/* FILE: test-fibheap.h */
#ifndef __TESTFIBHEAP__
#define __TESTFIBHEAP__
int test(int, int);
#endif

and now our test function:

/* FILE: test-fibheap.c */
#include "test-fibheap.h"
#include "fibheap.h"
#include < math.h>

int test(int n, int seed) {

  FibHeap *heap = FibHeapMakeHeap();

  int *arr = (int*) malloc(sizeof(int) * n);
  int i; 

  srand(seed);

  for(i = 0; i < n; i++) {
    arr[i] = i;
  }
  /* Insert the numbers [0, n) into the heap in a random order */
  for(i = 0; i < n-1; i++) {
    int r = rand() % (n - i) + i;
    int t = arr[i];
    arr[i] = arr[r];
    arr[r] = t;
    FibHeapInsert(heap, (double) i, 0);
  }
  /* remove all of them one at a time.  Same as sorting the array*/
  for(i = 0; i < n; i++) {
    FibHeapExtractMin(heap);
  }
  return 0;
}

This function takes two parameters, n which is how many elements to insert into the heap, and seed which is the seed to the random number generator. I don't want to use the current time as the seed because I want to be able to compare the efficiency on exactly the same structure.

Now, how to use the FFI:

{-# INCLUDE "test-fibheap.h" #-}
{-# LANGUAGE ForeignFunctionInterface #-}
module Main where
import Foreign.C -- get the C types

foreign import ccall unsafe "test" testFib ::
                  CInt -> CInt -> CInt

Of note are the first and last lines. The first line is just like #include. Then, the last line binds our C function int test(int,int) to the name testFib.

So, bringing it all together with Criterion:

{-# INCLUDE "test-fibheap.h" #-}
{-# LANGUAGE ForeignFunctionInterface #-}
module Main where
import Foreign.C -- get the C types
import Criterion.Main

-- "unsafe" means it's slightly faster but can't callback to haskell
foreign import ccall unsafe "test" testFib ::
               CInt -> CInt -> CInt

main = defaultMain [
          bgroup "fib" [bench title $ whnf (testFib n) seed]
          ]
  where
    n = 1000000
    seed = 7
    title = "testFibHeap " ++ (show n) ++ " " ++ (show seed)

Compile this with ghc -O --make profile.hs fibheap.o test-fibheap.o.
Then, run it ./profile -t png:500x175 -k png:500x175

I've run this twice. First with my fibheap code compiled without optimizations and then again with -O3

Results

Unoptimized

./profile -t png:500x175 -k png:500x175
warming up
estimating clock resolution...
mean is 9.843398 us (80001 iterations)
found 1259 outliers among 79999 samples (1.6%)
  1157 (1.4%) high severe
estimating cost of a clock call...
mean is 522.7480 ns (95 iterations)
found 3 outliers among 95 samples (3.2%)
  1 (1.1%) high mild
  2 (2.1%) high severe

benchmarking fib/testFibHeap 1000000 7
collecting 100 samples, 1 iterations each, in estimated 189.5377 s
bootstrapping with 100000 resamples
mean: 1.916353 s, lb 1.909884 s, ub 1.932393 s, ci 0.950
std dev: 48.79063 ms, lb 23.67417 ms, ub 99.28092 ms, ci 0.950
found 9 outliers among 100 samples (9.0%)
  4 (4.0%) high mild
  5 (5.0%) high severe
variance introduced by outliers: 0.998%
variance is unaffected by outliers



Optimized with -O3

./profile -t png:500x175 -k png:500x175
warming up
estimating clock resolution...
mean is 9.928848 us (80001 iterations)
found 1264 outliers among 79999 samples (1.6%)
  1150 (1.4%) high severe
estimating cost of a clock call...
mean is 522.4059 ns (95 iterations)
found 3 outliers among 95 samples (3.2%)
  1 (1.1%) high mild
  2 (2.1%) high severe

benchmarking fib/testFibHeapOptimized 1000000 7
collecting 100 samples, 1 iterations each, in estimated 67.12110 s
bootstrapping with 100000 resamples
mean: 670.1145 ms, lb 669.3443 ms, ub 671.1168 ms, ci 0.950
std dev: 4.445731 ms, lb 3.547659 ms, ub 5.887752 ms, ci 0.950
found 7 outliers among 100 samples (7.0%)
  4 (4.0%) high mild
  3 (3.0%) high severe
variance introduced by outliers: 0.990%
variance is unaffected by outliers



Post Mortem

Running the optimized test above yields:

real    0m0.722s
user    0m0.690s
sys     0m0.030s

which is comparable to the Criterion results. There's certainly an overhead involved because of the Haskell run-time, but for profiling relatively long running functions I don't think this will be that much of an issue.

I think it's worth adding to the arsenal, don't you?

February 25, 2010

The problem is timid edits.

Filed under: Programming,VCS — Tags: , , — Nathan @ 11:46 pm

I recently posted about Git and how it has changed my development style in only a week of using it. After discussing this with a couple of people, I believe I have a better way of articulating the underlying problem.

Fundamentally, it comes down to what I have been calling “timid edits”.

We’ve all been there. There’s some nasty piece of code that you just can’t get to work right. There’s some edge-case that you realize you forgot to deal with and you’re trying to wedge in the fix. You change this, you change that. You exert tremendous effort trying to bend your existing code to the right shape, but to no avail. In the end, after god knows how long, you just say screw it and rewrite the whole function.

Why did it take so long to reach that point? Why didn’t you just rewrite the offending code-block when you realized there was some flaw? I think sometimes we mislead ourselves into thinking that code that ‘almost-works’ is somehow actually ‘close’ to the actual solution when we are in fact at a dead end. Similarly, I think that we feel like deleting code will take us further away from our goal and perhaps we feel as if we can’t get back.

These timid edits don’t only affect debugging, but also get in the way of refactoring. “Our API works, why rewrite it?” “My class hierarchy works, why change it?” Because of our timidity when it comes to rewriting code we fail to reap the possible benefits like efficiency, clarity, or maintainability.

In order to deal with this issue—timid edits—we have to have confidence in our understanding of the problem itself, and have confidence in our ability to recreate its solution.

We have tools that can help us find our way back, when we go wandering away from the solution. These tools can embolden us. I have recently begun using Git for exactly this purpose. I keep very frequent way-points and also create lots of small, short-lived branches.

At the end of the day we need to stop floundering around, worrying with our code as it is. We need to have a good understanding of the tools at hand so that they can instill in us the confidence to stop timid edits.

February 22, 2010

Lessons I should have learned, Episode 3: Hot swapping binaries

About a year ago I was having a discussion with my friend Crutcher when he suggested that one could hot-swap versions of a running program. This post describes my implementation of just such a thing.

Why would you hot-swap? One of the major benefits of hotswapping is that the new version of the program will have access to all of the old version’s file-descriptors. This means that any files, sockets, or pipes that the previous version currently had open can still be open. For example, if one was careful, you could hotswap an application that was in the middle of serving a very large file to a user without him being aware that anything happened.

Before we begin discussing how this should work, let’s look at some of the problems. Sure we get file descriptors, but what about all of our state? Well, this is a problem. We cannot easily take our state with us. Since we’re updating versions here, you want to be particularly sure you only take state that you need. To do this, I would recommend using a serialization library for C. A bit of Googling showed me this one, TPL, though I haven’t tried it yet. For our example here, we’ll just manually move the only pieces of state we care about: a counter and a file descriptor to a file we’re currently writing to.

The basic idea of what’s going to happen is that we will create a pair of pipes and then fork(). The child process will hold the pipe that does the writing and the parent the one that does the reading. Now, the parent will exec. This is a bit odd. Normally when you fork, then exec, it’s the child process which does the exec. However, here we really want the new version of the program to have access to all of the old file descriptors. Luckily, execl preserves these. As an added benefit, the program gets the exact same process ID.

So, let’s look at the important bits of the hot-swap (reader and writer are the file descriptors for the pipe):

unsigned int outputFD = fileno(outputFile);

if(fork()) {
  /* I am the parent. */
  char readBuf[20] = {0};

  close(writer);
  sprintf(readBuf, "%d", reader);

  execl("./newbinary", "--hotswapping", readBuf, (char*)0);
  exit(0);
} else {
  /* I am the child.*/
  FILE *outputStream = fdopen(writer, "w");
  close(reader);

  fprintf(outputStream, "%d\n", i);
  fprintf(outputStream, "%u\n", (unsigned int) outputFD);
  fclose(outputStream);
  exit(0);
}

First, let’s look at what the parent process does. It simply closes its “writer” since it will never need to write to the pipe then it execs “newbinary” which is the new version of the program. It does so with a flag “–hotswapping”. This flag indicates another parameter will follow which is the file descriptor for the “read” end of the pipe we created. We do this so that the new binary can then get the state serialized across the pipe from the old binary.

Now, onto the child process. Line 32 creates a file handler from the file descriptor which is the “write” end of the pipe. Why? Because I’m lazy and I would prefer to work with fprintf() to write(). Now that we have this file handler, we can fprintf() directly to it and serialize the state we want. In this contrived case the only state I care about is my counter variable and the file descriptor of my output file. Line 19 gives me the descriptor from the handler using int fileno(FILE *).

So, to recap, we fork() then the parent exec’s to the new version of the binary and the child writes any relevant state to a pipe which the new binary is listening to.

Now, let’s look at what has to exist in the new binary. The new binary must recognize the argument “–hotswapping” which passes along the file descriptor of the “read” pipe. The following, in newversion.c does just this:

for(i = 0; i < argc; i++) {
  if(!strcmp(argv[i], "--hotswapping")) {
    int reader = atoi(argv[++i]);
    inputStream = fdopen(reader, "r");
  }
}

Notice that Line 17 does something interesting. It converts the file descriptor back to a file handler using fdopen(int, char*). Thus we can use this pipe just like we were reading from a file. So, now we can use fscanf to read from the pipe instead of having to worry about read() and buffers. This is done starting at line 21:

fscanf(inputStream, "%d", &i);
fscanf(inputStream, "%u", &outputFD);
fclose(inputStream);
outputFile = fdopen(outputFD, "w");

Once again, at line 24, we turn the file descriptor we read from the pipe back into a file handler. Now, we can resume writing to it, just as we did before. It will continue to append to the end of the file.

The files which implement this are available as gists here:

The output when run for 11 seconds is:

gcc -Wall -pedantic -o newbinary newversion.c
gcc -Wall -pedantic -o example original.c
./example
Original:       1 My PID=27272
Original:       2 My PID=27272
Original:       3 My PID=27272
Original:       4 My PID=27272
Original:       5 My PID=27272
New Binary:     6 My PID=27272
New Binary:     7 My PID=27272
New Binary:     8 My PID=27272
New Binary:     9 My PID=27272
New Binary:     10 My PID=27272
New Binary:     11 My PID=27272

February 19, 2010

Git it on

Filed under: Programming,VCS — Tags: , — Nathan @ 11:29 am

Subversion was my first. I didn’t know what I was doing. I fumbled around mostly. Our relationship was very vanilla. We had our everyday routine and very rarely did we deviate whatsoever. Mostly I would just checkout, only then to commit again later. Then, one day when I was out and about I met someone new. Git. That’s when everything changed. Things started being much faster and wilder. I was doing it in ways I never had before. Forks! Wow! The relationship is hot, and new, so perhaps it’s that the endorphines haven’t worn off yet, but I’m enamored. Yeah, I know that there are others just like Git out there; just as fast and fun, will do the same things if you whisper slightly different phrases. But, for now, Git is all I need.

Enough with the double entendre

OK, the back story. I have used Subversion for my personal projects for years. I even have my dissertation committed to a svn repo (which itself is backed up all over the planet!). Even though I’ve used Subversion for years, I do very little with it. svn up; svn add AFILE; svn commit pretty much summarizes most of my usage. A couple of times I did attempt to do things better by creating a branch to add a new feature, but that branch ended up taking a life of its own, leaving the main trunk an orphan and taking up all of my attention.

My usage of subversion was little more than a backed-up version of the tricks that I used to use before I had any revision control system. Let’s say I was about to do some major change to the file AFILE, then I would create AFILE.bak before editing, so that if I hosed something in a big way, I could at least put it back or diff it against the backup.

Sure, there are ways of doing all of this better with Subversion, but the system doesn’t really afford them. Thus my current infatuation with Git. I’ve only been using it a week, but my development style has already changed.

How do I use git? The main tool in my arsenal is git checkout. Let’s look at this example:

nathan @ macbook ~/repo/Example $ ls
Makefile hello.c  world.c

nathan @ macbook ~/repo/Example $ git branch
* master

As you can see, at the moment there’s only one branch: master. Let’s say that world.c is something that I spent a lot of time on, but I’m about to make a major revision to it. So, I would do this:

nathan @ macbook ~/repo/Example $ git checkout -b adding.new.feature
Switched to a new branch 'adding.new.feature'

nathan @ macbook ~/repo/Example $ ls
Makefile hello.c  world.c

nathan @ macbook ~/repo/Example $ git branch
* adding.new.feature
  master

I’m still in my same directory, but now any changes I make will happen only to this branch. When I’m done working on whatever minor revision I’m working with, making sure of course that I’ve tested everything thoroughly, I will commit the change to this branch, then switch back to “master”:

nathan @ macbook ~/repo/Example $ git add world.c 

nathan @ macbook ~/repo/Example $ git commit -m "Made revision \
to world.c"
[adding.new.feature b00248b] Made revision to world.c
 1 files changed, 1 insertions(+), 0 deletions(-)

nathan @ macbook ~/repo/Example $ git checkout master
Switched to branch 'master'

Since I’m now back at the master branch, even though I’m in the exact same directory, my view will be that of the master. The edits I made to “world.c” will not be visible. If I’m confident with those edits, I can merge them in.

nathan @ macbook ~/repo/Example $ git branch
  adding.new.feature
* master

nathan @ macbook ~/repo/Example $ git merge adding.new.feature
Updating b3d7029..b00248b
Fast-forward
 world.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

The fact that I can have a DAG (directed acyclic graph) of branches makes me happy. One for the love of graph theory, and two because I’m much bolder with my revisions and refactorings. Beyond all of the features that it offers, this confidence alone is worth the switch to Git, or any other DVCS.

The coup de grâce is that I use this on top of an existing Subversion repository. So I can collaborate with others just like I always did, with my post-commit scripts in place. Of course, this is unnecessary since I can just git push remote.repo and send it to any repo I’d like.

It’s only been a week, but I’m in love.

February 17, 2010

Lessons I should have learned, Episode 2: hiding your data

Filed under: Lessons I should have learned — Tags: , , — Nathan @ 12:49 am

Episode 2: hiding your data

Let’s consider a contrived example where you are representing people and their relationships. We want to represent a person’s name, social security number, address, and have pointers to mother and father.

typedef struct Person {
  char *name;
  char *address;
  int ssn;
  Person *mother;
  Person *father;
} Person;

Now, let’s say that we have some library which will create a queriable tree for us.

Person *mandy = insertPerson("Mandy", \
                  123331234, "Hut #13" 0, 0);
Person *naughtius = insertPerson("Naughtius", \
                  349830123, "Centurianville #2", 0, 0);
Person *brian = insertPerson("Brian", \
                  593013297, "Hut #13", mandy, naughtius);

This is all well and good. But there’s a problem. The user of this library can do something like this:

Person *mary = insertPerson("Mary", \
                  012309821, "Bethleham", 0, 0);
brian->mother = mary;

There might possibly be a use-case where we want to allow this to happen, but in general this isn’t desirable whatsoever. We really want to keep the fields mother, father, and ssn immutable. We can do this in the following way:

/* FILE: library.h */
typedef struct Person {
  char *name;
  char *address;
} Person;

int getSSN(Person *);
Person *getMother(Person *);
Person *getFather(Person *);
Person *insertPerson(char *, int, char *, Person*, Person*);

Now we have a struct Person which exposes only information that we will allow to be mutable. How can we use this cleanly and effectively? We define the following:

/* FILE: library.c */
#include "library.h"
typedef struct privatePerson {
  Person p;
  int ssn;
  privatePerson *mother;
  privatePerson *father;
} privatePerson;
...

How does this help us? First, consider the following:

privatePerson *mary = ...;
Person *mom = (Person *) mary;

What will happen here? First, recall how struct layout works. Since Person p is the first entry in struct privatePerson then the offset in the privatePerson struct is 0. This means that the memory location of Person p is the same as the struct privatePerson which contains it. So, when we cast to the Person pointer we are, in fact pointing at a Person object. So, insertPerson(...) will use a struct privatePerson internally but will return a struct Person. As will getMother(...).

/* FILE: library.c */
...
Person *getMother(Person *p) {
  privatePerson *who = (privatePerson *) p;
  return (Person *) who->mother;
}

This example illustrates how we can also cast back to a privatePerson from a Person. Note that this REQUIRES that the pointer to the struct Person also be a pointer to a struct privatePerson, i.e. you’ve already cast it once. We can’t ensure that our user will never malloc his own struct Person. The best we can do is provide factories for construction (insertPerson above) and strongly document that doing so is unsupported and will lead to undefined behavior.

Because struct privatePerson is defined within library.c, it won’t be visible to anyone who includes library.h. Just like with the previous post on opaque pointers, this isn’t magic. The user can still opt-in to violate our model by including his own definition of a privatePerson struct and casting back to it. Though hopefully by doing so the user will be perfectly aware that he’s breaking our API and thus is writing code that could potentially break in the next minor point release of the library.

Older Posts »

  • I'm a software engineer at Google
  • I'm from Alabama
  • I live in San Francisco
  • I like to work on ridiculous things
  • I'm currently learning German, Scala, and Computer Vision
  • This book referred to JavaScript I wrote when I was 15.