February 25, 2010

The problem is timid edits.

Filed under: Programming,VCS — Tags: , , — Nathan @ 11:46 pm

I recently posted about Git and how it has changed my development style in only a week of using it. After discussing this with a couple of people, I believe I have a better way of articulating the underlying problem.

Fundamentally, it comes down to what I have been calling “timid edits”.

We’ve all been there. There’s some nasty piece of code that you just can’t get to work right. There’s some edge-case that you realize you forgot to deal with and you’re trying to wedge in the fix. You change this, you change that. You exert tremendous effort trying to bend your existing code to the right shape, but to no avail. In the end, after god knows how long, you just say screw it and rewrite the whole function.

Why did it take so long to reach that point? Why didn’t you just rewrite the offending code-block when you realized there was some flaw? I think sometimes we mislead ourselves into thinking that code that ‘almost-works’ is somehow actually ‘close’ to the actual solution when we are in fact at a dead end. Similarly, I think that we feel like deleting code will take us further away from our goal and perhaps we feel as if we can’t get back.

These timid edits don’t only affect debugging, but also get in the way of refactoring. “Our API works, why rewrite it?” “My class hierarchy works, why change it?” Because of our timidity when it comes to rewriting code we fail to reap the possible benefits like efficiency, clarity, or maintainability.

In order to deal with this issue—timid edits—we have to have confidence in our understanding of the problem itself, and have confidence in our ability to recreate its solution.

We have tools that can help us find our way back, when we go wandering away from the solution. These tools can embolden us. I have recently begun using Git for exactly this purpose. I keep very frequent way-points and also create lots of small, short-lived branches.

At the end of the day we need to stop floundering around, worrying with our code as it is. We need to have a good understanding of the tools at hand so that they can instill in us the confidence to stop timid edits.

February 22, 2010

Lessons I should have learned, Episode 3: Hot swapping binaries

About a year ago I was having a discussion with my friend Crutcher when he suggested that one could hot-swap versions of a running program. This post describes my implementation of just such a thing.

Why would you hot-swap? One of the major benefits of hotswapping is that the new version of the program will have access to all of the old version’s file-descriptors. This means that any files, sockets, or pipes that the previous version currently had open can still be open. For example, if one was careful, you could hotswap an application that was in the middle of serving a very large file to a user without him being aware that anything happened.

Before we begin discussing how this should work, let’s look at some of the problems. Sure we get file descriptors, but what about all of our state? Well, this is a problem. We cannot easily take our state with us. Since we’re updating versions here, you want to be particularly sure you only take state that you need. To do this, I would recommend using a serialization library for C. A bit of Googling showed me this one, TPL, though I haven’t tried it yet. For our example here, we’ll just manually move the only pieces of state we care about: a counter and a file descriptor to a file we’re currently writing to.

The basic idea of what’s going to happen is that we will create a pair of pipes and then fork(). The child process will hold the pipe that does the writing and the parent the one that does the reading. Now, the parent will exec. This is a bit odd. Normally when you fork, then exec, it’s the child process which does the exec. However, here we really want the new version of the program to have access to all of the old file descriptors. Luckily, execl preserves these. As an added benefit, the program gets the exact same process ID.

So, let’s look at the important bits of the hot-swap (reader and writer are the file descriptors for the pipe):

unsigned int outputFD = fileno(outputFile);

if(fork()) {
  /* I am the parent. */
  char readBuf[20] = {0};

  close(writer);
  sprintf(readBuf, "%d", reader);

  execl("./newbinary", "--hotswapping", readBuf, (char*)0);
  exit(0);
} else {
  /* I am the child.*/
  FILE *outputStream = fdopen(writer, "w");
  close(reader);

  fprintf(outputStream, "%d\n", i);
  fprintf(outputStream, "%u\n", (unsigned int) outputFD);
  fclose(outputStream);
  exit(0);
}

First, let’s look at what the parent process does. It simply closes its “writer” since it will never need to write to the pipe then it execs “newbinary” which is the new version of the program. It does so with a flag “–hotswapping”. This flag indicates another parameter will follow which is the file descriptor for the “read” end of the pipe we created. We do this so that the new binary can then get the state serialized across the pipe from the old binary.

Now, onto the child process. Line 32 creates a file handler from the file descriptor which is the “write” end of the pipe. Why? Because I’m lazy and I would prefer to work with fprintf() to write(). Now that we have this file handler, we can fprintf() directly to it and serialize the state we want. In this contrived case the only state I care about is my counter variable and the file descriptor of my output file. Line 19 gives me the descriptor from the handler using int fileno(FILE *).

So, to recap, we fork() then the parent exec’s to the new version of the binary and the child writes any relevant state to a pipe which the new binary is listening to.

Now, let’s look at what has to exist in the new binary. The new binary must recognize the argument “–hotswapping” which passes along the file descriptor of the “read” pipe. The following, in newversion.c does just this:

for(i = 0; i < argc; i++) {
  if(!strcmp(argv[i], "--hotswapping")) {
    int reader = atoi(argv[++i]);
    inputStream = fdopen(reader, "r");
  }
}

Notice that Line 17 does something interesting. It converts the file descriptor back to a file handler using fdopen(int, char*). Thus we can use this pipe just like we were reading from a file. So, now we can use fscanf to read from the pipe instead of having to worry about read() and buffers. This is done starting at line 21:

fscanf(inputStream, "%d", &i);
fscanf(inputStream, "%u", &outputFD);
fclose(inputStream);
outputFile = fdopen(outputFD, "w");

Once again, at line 24, we turn the file descriptor we read from the pipe back into a file handler. Now, we can resume writing to it, just as we did before. It will continue to append to the end of the file.

The files which implement this are available as gists here:

The output when run for 11 seconds is:

gcc -Wall -pedantic -o newbinary newversion.c
gcc -Wall -pedantic -o example original.c
./example
Original:       1 My PID=27272
Original:       2 My PID=27272
Original:       3 My PID=27272
Original:       4 My PID=27272
Original:       5 My PID=27272
New Binary:     6 My PID=27272
New Binary:     7 My PID=27272
New Binary:     8 My PID=27272
New Binary:     9 My PID=27272
New Binary:     10 My PID=27272
New Binary:     11 My PID=27272

February 19, 2010

Git it on

Filed under: Programming,VCS — Tags: , — Nathan @ 11:29 am

Subversion was my first. I didn’t know what I was doing. I fumbled around mostly. Our relationship was very vanilla. We had our everyday routine and very rarely did we deviate whatsoever. Mostly I would just checkout, only then to commit again later. Then, one day when I was out and about I met someone new. Git. That’s when everything changed. Things started being much faster and wilder. I was doing it in ways I never had before. Forks! Wow! The relationship is hot, and new, so perhaps it’s that the endorphines haven’t worn off yet, but I’m enamored. Yeah, I know that there are others just like Git out there; just as fast and fun, will do the same things if you whisper slightly different phrases. But, for now, Git is all I need.

Enough with the double entendre

OK, the back story. I have used Subversion for my personal projects for years. I even have my dissertation committed to a svn repo (which itself is backed up all over the planet!). Even though I’ve used Subversion for years, I do very little with it. svn up; svn add AFILE; svn commit pretty much summarizes most of my usage. A couple of times I did attempt to do things better by creating a branch to add a new feature, but that branch ended up taking a life of its own, leaving the main trunk an orphan and taking up all of my attention.

My usage of subversion was little more than a backed-up version of the tricks that I used to use before I had any revision control system. Let’s say I was about to do some major change to the file AFILE, then I would create AFILE.bak before editing, so that if I hosed something in a big way, I could at least put it back or diff it against the backup.

Sure, there are ways of doing all of this better with Subversion, but the system doesn’t really afford them. Thus my current infatuation with Git. I’ve only been using it a week, but my development style has already changed.

How do I use git? The main tool in my arsenal is git checkout. Let’s look at this example:

nathan @ macbook ~/repo/Example $ ls
Makefile hello.c  world.c

nathan @ macbook ~/repo/Example $ git branch
* master

As you can see, at the moment there’s only one branch: master. Let’s say that world.c is something that I spent a lot of time on, but I’m about to make a major revision to it. So, I would do this:

nathan @ macbook ~/repo/Example $ git checkout -b adding.new.feature
Switched to a new branch 'adding.new.feature'

nathan @ macbook ~/repo/Example $ ls
Makefile hello.c  world.c

nathan @ macbook ~/repo/Example $ git branch
* adding.new.feature
  master

I’m still in my same directory, but now any changes I make will happen only to this branch. When I’m done working on whatever minor revision I’m working with, making sure of course that I’ve tested everything thoroughly, I will commit the change to this branch, then switch back to “master”:

nathan @ macbook ~/repo/Example $ git add world.c 

nathan @ macbook ~/repo/Example $ git commit -m "Made revision \
to world.c"
[adding.new.feature b00248b] Made revision to world.c
 1 files changed, 1 insertions(+), 0 deletions(-)

nathan @ macbook ~/repo/Example $ git checkout master
Switched to branch 'master'

Since I’m now back at the master branch, even though I’m in the exact same directory, my view will be that of the master. The edits I made to “world.c” will not be visible. If I’m confident with those edits, I can merge them in.

nathan @ macbook ~/repo/Example $ git branch
  adding.new.feature
* master

nathan @ macbook ~/repo/Example $ git merge adding.new.feature
Updating b3d7029..b00248b
Fast-forward
 world.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

The fact that I can have a DAG (directed acyclic graph) of branches makes me happy. One for the love of graph theory, and two because I’m much bolder with my revisions and refactorings. Beyond all of the features that it offers, this confidence alone is worth the switch to Git, or any other DVCS.

The coup de grâce is that I use this on top of an existing Subversion repository. So I can collaborate with others just like I always did, with my post-commit scripts in place. Of course, this is unnecessary since I can just git push remote.repo and send it to any repo I’d like.

It’s only been a week, but I’m in love.

February 17, 2010

Lessons I should have learned, Episode 2: hiding your data

Filed under: Lessons I should have learned — Tags: , , — Nathan @ 12:49 am

Episode 2: hiding your data

Let’s consider a contrived example where you are representing people and their relationships. We want to represent a person’s name, social security number, address, and have pointers to mother and father.

typedef struct Person {
  char *name;
  char *address;
  int ssn;
  Person *mother;
  Person *father;
} Person;

Now, let’s say that we have some library which will create a queriable tree for us.

Person *mandy = insertPerson("Mandy", \
                  123331234, "Hut #13" 0, 0);
Person *naughtius = insertPerson("Naughtius", \
                  349830123, "Centurianville #2", 0, 0);
Person *brian = insertPerson("Brian", \
                  593013297, "Hut #13", mandy, naughtius);

This is all well and good. But there’s a problem. The user of this library can do something like this:

Person *mary = insertPerson("Mary", \
                  012309821, "Bethleham", 0, 0);
brian->mother = mary;

There might possibly be a use-case where we want to allow this to happen, but in general this isn’t desirable whatsoever. We really want to keep the fields mother, father, and ssn immutable. We can do this in the following way:

/* FILE: library.h */
typedef struct Person {
  char *name;
  char *address;
} Person;

int getSSN(Person *);
Person *getMother(Person *);
Person *getFather(Person *);
Person *insertPerson(char *, int, char *, Person*, Person*);

Now we have a struct Person which exposes only information that we will allow to be mutable. How can we use this cleanly and effectively? We define the following:

/* FILE: library.c */
#include "library.h"
typedef struct privatePerson {
  Person p;
  int ssn;
  privatePerson *mother;
  privatePerson *father;
} privatePerson;
...

How does this help us? First, consider the following:

privatePerson *mary = ...;
Person *mom = (Person *) mary;

What will happen here? First, recall how struct layout works. Since Person p is the first entry in struct privatePerson then the offset in the privatePerson struct is 0. This means that the memory location of Person p is the same as the struct privatePerson which contains it. So, when we cast to the Person pointer we are, in fact pointing at a Person object. So, insertPerson(...) will use a struct privatePerson internally but will return a struct Person. As will getMother(...).

/* FILE: library.c */
...
Person *getMother(Person *p) {
  privatePerson *who = (privatePerson *) p;
  return (Person *) who->mother;
}

This example illustrates how we can also cast back to a privatePerson from a Person. Note that this REQUIRES that the pointer to the struct Person also be a pointer to a struct privatePerson, i.e. you’ve already cast it once. We can’t ensure that our user will never malloc his own struct Person. The best we can do is provide factories for construction (insertPerson above) and strongly document that doing so is unsupported and will lead to undefined behavior.

Because struct privatePerson is defined within library.c, it won’t be visible to anyone who includes library.h. Just like with the previous post on opaque pointers, this isn’t magic. The user can still opt-in to violate our model by including his own definition of a privatePerson struct and casting back to it. Though hopefully by doing so the user will be perfectly aware that he’s breaking our API and thus is writing code that could potentially break in the next minor point release of the library.

February 11, 2010

Lessons I should have learned, Episode 1: opaque pointers

Filed under: Lessons I should have learned — Tags: , , — Nathan @ 11:08 pm

I’ve spent my entire adult life in academia. For some reason academia, even in software engineering courses, doesn’t really harp on best practices or good patterns to know. It is possible that some do and that I’ve just missed it since I specialized in theory. So, this article will be the first in a larger category of articles which will describe patterns and techniques that I should have known but didn’t.

Episode 1: opaque pointers

I don’t care what anyone says. C is a great language. About a year ago I undertook the task of implementing a Fibonacci heap for use in an empirical analysis of two labeled graph algorithms. I put the code away months and months ago, but luckily I just found out that I got some CPU time on a very prestigious cluster so I’ve been reworking the implementations to collect better statistics. Also, since I’m finally going to do this analysis and it’s going to be published (at least in my dissertation) then I will most definitely release all of the code so that the experiments can be replayed and verified. So, since other people will look at this code I’ve spent several hours over the last few days cleaning it up and making the interfaces more reasonable. This lead me to actually pulling my fibheap out into its own stand alone project.

The fibheap isn’t ready for GitHub just yet (though look for it soon), but I’ve been working on the API and learning a lot about how this kind of thing should work. The user should be completely unaware of the internal structure of the data structure. He shouldn’t be able to mutate it in any way except through the accessors that the API provides. Though, sometimes he will need to hold pointers to particular elements in the heap (for DecreaseKey calls). This means that we need some way of providing access to elements without giving away any of our top-secret structure. So, the very simple concept that I should have known about: opaque pointers.

The crux of the matter is to separate declaration from definition just like we were always told (except in the case of templated classes… sigh). In the example of the fibheap, we want the user to have a handle to the fibheap but to otherwise not be able to affect the heap except through the API functions. So, we do this:

/* FILE: fibheap.h */
typedef struct FibHeap FibHeap;

FibHeapElement *fibHeapGetRoot(FibHeap*);
unsigned int fibHeapGetSize(FibHeap*);

And that’s it. There is no definition of the structure here, only a declaration of it. Then:

/* FILE: fibheap.c */
struct FibHeap {
  FibHeapElement *root;
  unsigned int size;
};

Because only the declaration is in the header, whenever the user does #include "fibheap.h" he only gets the name of the struct, not the layout. Thus the user cannot access either of the fields without the accessor functions fibHeapGetRoot(FibHeap*) and fibHeapGetSize(FibHeap*). No instantiation for you! (caveat: the user could easily define the struct himself and have access to the internals of the structure. This is the kind of hard opt-in that hopefully makes it clear to the user that he’s violating the spirit of the API and that this will not necessarily be stable with future versions.)

So, to summarize:

Opaque pointers allow you to provide handlers to structs to which your users have no idea the form or actual structure of.

The next installment will be of a couple of ways that you can provide partial information to the user instead of none. Stay tuned!

February 8, 2010

Google Videos

Filed under: blog — Tags: , — Nathan @ 5:27 pm

This is one of the new Google Search Stories. I must say, pretty beautiful. I think this is a pretty good answer to those silly Bing commercials.

February 6, 2010

Books and such.

Filed under: Programming,Theory,TheoryBook — Tags: , , — Nathan @ 12:31 am

For years I’ve wanted to write a book. My friend Charles and I thought about writing a book on probability back when I was an undergraduate. Most recently I’ve been thinking about writing a computer science book which is an introduction to algorithms. This was spurred by my teaching exactly such a class this semester.

CS260 at the University of Alabama is entitled “Foundations of Computer Science”. This course is intended to be an introduction to the theoretical aspects of the field, ranging from analysis of algorithms to simple datastructures. This course is for second semester students, so unfortunately they’ve only had one semester of programming and don’t have a very strong math background. This means that we can’t do the more rigorous analysis of algorithms and fun proofs that would need things like counting and inductive proofs. So, I’ve decided to take a different tact. I’m trying to help them develop an intuition of the concepts by starting with the implementation of some algorithm (sorting algorithm, actually) and walking them toward a function which is computes how much work is required without actually doing the work. By then plotting these function they can get a visual representation of how much work is actually being done and see its relation to other mathematical functions. This has actually proven pretty effective at teaching asymptotics without actually teaching asymptotics.

Therefore, the book that I want to write is an introduction to theoretical computer science via empirical analysis. What if someone scoops me because of this? I don’t care. I think that this would actually help a lot of students and if I don’t get enough time to do it, then someone else should. So, I’m going to try to start putting the sketches of sections here. Eventually I’d like to set up a system like Bryan O’Sullivan had for Real World Haskell with random internet users being able to comment on parts of the books with suggestions. Stay tuned!

February 4, 2010

Mantis

Filed under: SysAdmin — Tags: , , , — Nathan @ 5:44 pm

I just heard about Mantis a PHP based bug-tracker. I’ve never really had experience with any of these and it looks much lighter weight than Bugzilla or something like that. I haven’t even gotten to play with it yet because of installation issues. So, let me share some of them with you now:

Go ahead an temporarily create a muser user in my MySQL db:

GRANT ALL ON *.* TO muser@locahost IDENTIFIED BY 'somepassword';

You’ll take away global permissions in a minute.

Now, just unzip the Mantiz in some directory that your webserver has access to and then navigate to it in your browser. This was “/var/www/mantis” and http://localhost/mantis for me.

Following the instructions will get this all installed. Now, I don’t have a mail server, so I cannot create users. So, the default user is “administrator” and the default password is “root”. This took me some googling to figure out.

Create a new user for yourself. Note that if you don’t have a mail server running, you cannot successfully create an account because you don’t know what the temporary password is. There’s probably a better way to fix this, but here’s how I did it.

Create a new file in your webdirecotry named “w.php”:

< ?php
$mynewpassword='whatever';
echo md5($mynewpassword) . "
"; echo crypt($mynewpassword); ?>

When you run this, the output should be some very long string. For me, the first one did the trick. Copy this one to the clipboard. So, let’s get our hands dirty and fix this. Go into mysql as root (sudo mysql):

mysql> USE  bugtracker;
mysql> UPDATE mantis_user_table
             SET password='WhateverYouCopiedEarlier'
             WHERE username='yourusername';

Now, try to log in to your Mantis. If this didn’t work, try the second long string that “crypt” generated. If that didn’t work, um, sorry.

Ok, so putting things back the way you found them:

REVOKE ALL ON *.* FROM muser@localhost;
GRANT ALL ON bugtracker.* TO muser@localhost;

Finally, you should disable your “administrator” account.

That’s all I have for now. I might have a review or something later. This was just a brief writeup of what was necessary for me to get rolling.

February 3, 2010

Stupid mistakes.

Filed under: Programming — Tags: , , — Nathan @ 5:55 pm

I just saw a post on /r/programming asking people to describe their dumbest coding mistakes. Instead of just posting one of mine there, I’ll do it here.

Many, many years ago, back in 2005 when I was a green grad student and an all around n00b, I got the opportunity to work on the auto-discovery daemon for OpenMosix (may it rest in peace). OpenMosix was a self-assembling cluster system. You would just install it on several machines in the same subnet, and bam you’d have a cluster that automatically distributed jobs. On with the story.

I don’t remember what part of the daemon I was working on. It was already doing some auto-discovery and self assembly. I was pretty pleased. The basic idea was that it would fork off a listener and have another process that would chatter on broadcast announcing its presence. At some point, about 5 minutes into the run, the whole thing would segfault. More precisely, at 4:38 in, it would segfault. Wait, what? The same exact amount of time to crash? That’s just weird! Well, no kidding.

This killed me for several hours. I had no idea what was up. That’s the day I got familiar with using GDB to attach to running processes. And what did I discover? I out clevered myself. I had code like this:

fprintf(stdout, "This is some stupid debugging \
message that I'll eventually remove\n");

That looks all well and good. Why is a printf breaking? Well, it’s breaking because of this line:

fclose(stdout);

See above where I said that I out-clevered myself. Basically, I wanted to turn off all that annoying debugging chatter. So, what the hell, why not just close stdout? It worked! No more chatter. Well, that’s not all. See, whenever using printf, I was writing to a closed file handler… but it was buffered. So, my program would continue to run, dumping silly messages to this buffer until the buffer ran out of space and it smashed some import part of memory causing the segfault. Oops.

So, what did I learn? Debugging messages are important, but use some kind of clever logging system, even if it’s just a series of macros. Sometimes, if your idea seems really clever, take a step back because it might lead you to more trouble than it’s worth.

Testing

Filed under: Uncategorized — Tags: — Nathan @ 3:55 pm

I’m still familiarizing myself with WordPress and its features. This is a code test:

C:

int foo(double bar) {
  return bar ^ 0x1111000;
}

I’d like to be able to have source code highliting done staticly. Anyone know of a plugin for that? (ans: CodeHighlighter)

Haskell:

fac 0 = 1
fac n = (*) n $ fac $ n - 1

JavaScript:

function silly() {
  this.foo = "bar";
  document.body.appendChild(document.createElement("hr"));
}

I’ll be editing this post occasionally to try to test new features, like LaTeX support amongst others.

This is a
$latex i\hbar\frac{\partial}{\partial t}\left|\Psi(t)\right>=H\left|\Psi(t)\right>$
test.

Older Posts »

  • I'm a software engineer at Google
  • I'm from Alabama
  • I live in San Francisco
  • I like to work on ridiculous things
  • I'm currently learning German, Scala, and Computer Vision
  • This book referred to JavaScript I wrote when I was 15.