Hi! Welcome...

This is the blog of Jake Boxer, a software engineer in Cambridge, Massachusetts.

13 December 2009 ~ 15 Comments

The Knuth-Morris-Pratt Algorithm in my own words

For the past few days, I’ve been reading various explanations of the Knuth-Morris-Pratt string searching algorithms. For some reason, none of the explanations were doing it for me. I kept banging my head against a brick wall once I started reading “the prefix of the suffix of the prefix of the…”.

Finally, after reading the same paragraph of CLRS over and over for about 30 minutes, I decided to sit down, do a bunch of examples, and diagram them out. I now understand the algorithm, and can explain it. For those who think like me, here it is in my own words. As a side note, I’m not going to explain why it’s more efficient than naïve string matching; that’s explained perfectly well in a multitude of places. I’m going to explain exactly how it works, as my brain understands it.

The Partial Match Table

The key to KMP, of course, is the partial match table. The main obstacle between me and understanding KMP was the fact that I didn’t quite fully grasp what the values in the partial match table really meant. I will now try to explain them in the simplest words possible.

Here’s the partial match table for the pattern “abababca”:

char:  | a | b | a | b | a | b | c | a |
index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
value: | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |

If I have an eight-character pattern (let’s say “abababca” for the duration of this example), my partial match table will have eight cells. If I’m looking at the eighth and last cell in the table, I’m interested in the entire pattern (“abababca”). If I’m looking at the seventh cell in the table, I’m only interested in the first seven characters in the pattern (“abababc”); the eighth one (“a”) is irrelevant, and can go fall off a building or something. If I’m looking at the sixth cell of the in the table… you get the idea. Notice that I haven’t talked about what each cell means yet, but just what it’s referring to.

Now, in order to talk about the meaning, we need to know about proper prefixes and proper suffixes.

Proper prefix: All the characters in a string, with one or more cut off the end. “S”, “Sn”, “Sna”, and “Snap” are all the proper prefixes of “Snape”.

Proper suffix: All the characters in a string, with one or more cut off the beginning. “agrid”, “grid”, “rid”, “id”, and “d” are all proper suffixes of “Hagrid”.

With this in mind, I can now give the one-sentence meaning of the values in the partial match table:

The length of the longest proper prefix in the (sub)pattern that matches a proper suffix in the same (sub)pattern.

Let’s examine what I mean by that. Say we’re looking in the third cell. As you’ll remember from above, this means we’re only interested in the first three characters (“aba”). In “aba”, there are two proper prefixes (“a” and “ab”) and two proper suffixes (“a” and “ba”). The proper prefix “ab” does not match either of the two proper suffixes. However, the proper prefix “a” matches the proper suffix “a”. Thus, the length of the longest proper prefix that matches a proper suffix, in this case, is 1.

Let’s try it for cell four. Here, we’re interested in the first four characters (“abab”). We have three proper prefixes (“a”, “ab”, and “aba”) and three proper suffixes (“b”, “ab”, and “bab”). This time, “ab” is in both, and is two characters long, so cell four gets value 2.

Just because it’s an interesting example, let’s also try it for cell five, which concerns “ababa”. We have four proper prefixes (“a”, “ab”, “aba”, and “abab”) and four proper suffixes (“a”, “ba”, “aba”, and “baba”). Now, we have two matches: “a” and “aba” are both proper prefixes and proper suffixes. Since “aba” is longer than “a”, it wins, and cell five gets value 3.

Let’s skip ahead to cell seven (the second-to-last cell), which is concerned with the pattern “abababc”. Even without enumerating all the proper prefixes and suffixes, it should be obvious that there aren’t going to be any matches; all the suffixes will end with the letter “c”, and none of the prefixes will. Since there are no matches, cell seven gets 0.

Finally, let’s look at cell eight, which is concerned with the entire pattern (“abababca”). Since they both start and end with “a”, we know the value will be at least 1. However, that’s where it ends; at lengths two and up, all the suffixes contain a c, while only the last prefix (“abababc”) does. This seven-character prefix does not match the seven-character suffix (“bababca”), so cell eight gets 1.

How to use the Partial Match Table

We can use the values in the partial match table to skip ahead (rather than redoing unnecessary old comparisons) when we find partial matches. The formula works like this:

If a partial match of length partial_match_length is found and table[partial_match_length] > 1, we may skip ahead partial_match_length - table[partial_match_length - 1] characters.

Let’s say we’re matching the pattern “abababca” against the text “bacbababaabcbab”. Here’s our partial match table again for easy reference:

char:  | a | b | a | b | a | b | c | a |
index: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
value: | 0 | 0 | 1 | 2 | 3 | 4 | 0 | 1 |

The first time we get a partial match is here:

bacbababaabcbab
 |
 abababca

This is a partial_match_length of 1. The value at table[partial_match_length - 1] (or table[0]) is 0, so we don’t get to skip ahead any. The next partial match we get is here:

bacbababaabcbab
    |||||
    abababca

This is a partial_match_length of 5. The value at table[partial_match_length - 1] (or table[4]) is 3. That means we get to skip ahead partial_match_length - table[partial_match_length - 1] (or 5 - table[4] or 5 - 3 or 2) characters:

// x denotes a skip

bacbababaabcbab
    xx|||
      abababca

This is a partial_match_length of 3. The value at table[partial_match_length - 1] (or table[2]) is 1. That means we get to skip ahead partial_match_length - table[partial_match_length - 1] (or 3 - table[2] or 3 - 1 or 2) characters:

// x denotes a skip

bacbababaabcbab
      xx|
        abababca

At this point, our pattern is longer than the remaining characters in the text, so we know there’s no match.

Conclusion

So there you have it. Like I promised before, it’s no exhaustive explanation or formal proof of KMP; it’s a walk through my brain, with the parts I found confusing spelled out in extreme detail. If you have any questions or notice something I messed up, please leave a comment; maybe we’ll all learn something.

Tags:

11 December 2009 ~ 0 Comments

Unit tests need to be fast

If you have a unit test suite, you need to be able to run it fast. Like, in under a minute (if you’re on a huge system with tons of tests, then each individual component’s tests should run this fast).

Many programmers develop a reflex of hitting Command+S (or Ctrl+S) every few lines of written code. This is possible because saving a file takes under a second. Ideally, programmers should develop a similar reflex with unit tests. Maybe not every few lines, but after every “chunk” of work a programmer produces produce (a new or altered function, a new instance variable in a class, etc.), he or she should be running the test suite, even if they know it’s going to fail. This is a fantastic rhythm to get into; it provides instant concrete feedback, and a visible goal to strive for at all times.

Unfortunately, it’s pretty much impossible if your unit tests take minutes to run. No programmer is going to start a five-minute test run if they know in advance it’s going to fail, and without doing that, no programmer can really get into the full unit testing rhythm.

To put it in simple terms, unit tests are one of the most important tools in the “fail fast” toolbox, but if the unit tests themselves don’t “fail fast,” there’s no way they can do their job properly

24 November 2009 ~ 0 Comments

Losing all respect for Rex Ryan

Not that I ever had much respect for New York Jets coach Rex Ryan in the first place, but I now have zero.

In the Patriots-Jets game on Sunday, with the Patriots up 31-14 and 30 seconds left on the clock, Tom Brady tried one last deep throw. It fell incomplete, and the Patriots punted on the next play. Pulled from Yahoo! Sports:

New York Jets coach Rex Ryan felt “disrespected” by the New England Patriots for throwing a deep pass with the game well in hand Sunday.

[...]

“We need to stop them anyway, so it’s no biggie, but I was surprised, and I did feel a little bit disrespected,” Ryan said.

Ryan added that he didn’t know if Bill Belichick was behind the call, saying it might have been something Brady and Moss did on their own.

[...]

Ryan, who created controversy in the offseason when he said he didn’t come to New York to “kiss Bill Belichick’s rings,” said he called a timeout with 5 seconds left as a response to the Patriots’ play call.

If he wants to trash talk before every game against the Patriots, there’s nothing particularly horrible about that. But if he follows that up by crying like a little baby when the Patriots do something that he perceives as “disrespect,” he and the rest of his team should be embarrassed.

17 November 2009 ~ 0 Comments

A method with no unit tests is a broken method

If you write software, you need to write unit tests. If you’ve written a method/function, and you haven’t written a unit test for it, it’s safe to assume that it’s broken (even if it compiles and your other tests pass).

I’m not necessarily advocating full-fledged test-driven development. I’m just saying, if you release code into “the wild,” and there are methods you haven’t unit tested, your customers will almost certainly run into multiple bugs in each one of them.

That’s an atomic point. Separate from that, I’d like to mention that this isn’t always a bad thing. For a startup that wants to iterate as quickly as possible (and is writing non-life-critical software), writing the code with no unit tests, releasing it, and reproducing each customer-discovered bug with a unit test before fixing it is a totally reasonable model. These startups just shouldn’t operate under the illusion that their software “works”. In the hours after they make one of these releases, they should feel blessed if a single customer is able to register or log in.

16 November 2009 ~ 17 Comments

Using C, convert a dynamically-allocated int array to a comma-separated string as cleanly as possible

EDIT: There are no “dynamic arrays”, so to speak, in C. What I meant was “dynamically-allocated”. I’ve updated the wording to reflect this.

EDIT 2: Someone on Reddit pointed out that my Python example doesn’t actually work, since I have an array of ints rather than strings. I’ve updated the code example so it works.

I’m much less experienced in C than I am in higher-level languages. At Cisco, we use C, and I sometimes run into something that would be easy to do in Java or Python, but very difficult to do in C. Now is one of those times.

I have a dynamically-allocated array of unsigned integers which I need to convert to a comma-separated string for logging. While the integers are not likely to be very large, they could conceptually be anywhere from 0 to 4,294,967,295 In Python, that’s one short line.

my_str = ','.join([str(num) for num in my_list])

How elegantly can people do this in C? I came up with a way, but it’s gross. If anyone knows a nice way to do it, please enlighten me.

Tags: , , ,

10 November 2009 ~ 0 Comments

Any teachers or professors using classroom management software?

Every few months, I get an idea for a web app. My most recent one stems from hearing constant complaints from my UMass professors about the UMass classroom management software. It does a whole boatload of stuff, but apparently it’s extremely hard to use. Keep in mind, I’m hearing that criticism from Computer Science professors. I’d wager a bet that it’s much worse for professors whose subjects are unrelated to computers.

With that in mind, I’ve been working for the last month or so on my own classroom management web app. I doubt I’ll find many teachers or professors with this post, but if I do, could you please leave me a comment? Specifically, I’d like to know if you use classroom management software. If you do, what do you use, and do you have any complaints about it? If you don’t, what areas of your job would you like managed by software? Grading and student notifications (“You have a test in 2 days!”) are the obvious ones to me, but are there any good ones I’m not thinking of?

04 November 2009 ~ 0 Comments

Little Joys of Rails (Part 1) – Renaming Associations

I’ve spent the last month or so teaching myself the basics of Ruby on Rails. By far my favorite thing about it is that, at every turn, I find a little feature or helper that works really well and does exactly what I want without feeling messy or “magical”.

Unfortunately, I have no one to share these little joys with, as pretty much none of my programmer friends know Ruby or Ruby on Rails. So, I figured I’d start sharing them on here. Hopefully, Rails newbies (like myself) will learn something, and Rails veterans will be able to chuckle knowingly as they fondly look back on the glory days of learning Rails.

Renaming Associations

If you have one model that belongs_to another, the association (in both directions) is named after the models. For example, if I have Users and Courses, and I want a course to belong_to its teacher (who is a User), I would start with the following:

class User < ActiveRecord::Base
  has_many :courses
end

class Course < ActiveRecord::Base
  belongs_to :user
end

This is nice and simple, but if I want to access a course’s teacher, I have to do it like this:

my_course = Course.first
puts my_course.user # Print out the course's teacher

That’s a little confusing. What if my course also has_many students (also Users)?

my_course = Course.first
puts my_course.users.first # Print out the course's first student

So, my_course.user is the teacher, and my_course.users are the students. There’s nothing to indicate that, I just have to know it. Wouldn’t it be nicer if the associations could be named for what they represent?

class Course < ActiveRecord::Base
  belongs_to :teacher, :class_name => 'User'
end

To make this work properly, the foreign key needs to be teacher_id instead of user_id (that can be overridden too, but I won’t go into that right now). Then, you get to do this:

my_course = Course.first
puts my_course.teacher # Print out the course's teacher

Much better!

Now, I recognize that this is far from a revolutionary feature. I’m pretty sure it’s possible in Django and most other MVC frameworks. I was just very happy to bump into it.

13 October 2009 ~ 0 Comments

Twitter Weekly Updates for 2009-10-13

  • "isn't sour cream like, the mayonnaise of mexico?" – alicia #
  • dunno if obama deserves a nobel peace prize or no, but he does deserve recognition 4 peace achievements, and im not a judge, so im fine w/it #
  • anyone have any google wave invites? i'd really like to take a look at it #
  • don't need a google wave invite anymore! thanks a ton @MattMarzilli #
  • me: "it sux cuz sports stars peak at like 28. we wont peak till we retire."
    alex: "i dunno, i think i peaked in 7th grade" #
  • haha randy moss's first caught ball of the day was thrown by kyle orton #

Powered by Twitter Tools

06 October 2009 ~ 0 Comments

Twitter Weekly Updates for 2009-10-06

  • i wish the code examples in all the ruby on rails guides didnt omit the parentheses on methods with parameters. it always messes me up. #
  • Bueno y Sano 5/5 on #Yelp: Along with Anna's Taqueria (which I believe is only in/near Boston), this is my favorite … http://bit.ly/2q1SB #
  • asked my man at the grill for "rice" and he heard "fries". +20g fat for the day :( #
  • "First I was aroused, then I was furious" – Jane Lynch, funniest line of all the episodes of Glee #
  • this looks like it could be everything Puzzle Quest Galactrix wasn't and more: http://bit.ly/8fuQU #
  • outlook is definitely microsoft's best product. it's their only one that gives me more "that's cool" moments than "omg i wanna die" moments #
  • 11 yd WR screen = edelman's least bad play of the year #
  • this is the longest last 2 minutes of the first half of any NFL game i've ever seen #
  • weird time to call timeout by john harbaugh. either call it right away or wait till after the play, not with 3 sec left on the clock. #
  • haha i dunno if i've ever seen a penalty against a team's bench before. ravens need to chill. #
  • wtf you can't say "injury timeout on the field" and then cut to commercial before you say who's injured #
  • im glad we won, but i feel bad for the ravens. he absolutely should've made that catch. #
  • testing and debugging in a compiled language really makes me appreciate interpreted languages #
  • wooo pats bringin back junior seau!! #

Powered by Twitter Tools

29 September 2009 ~ 0 Comments

Twitter Weekly Updates for 2009-09-29

  • can anyone with ruby on rails experience recommend a good book? im about to start on it. #
  • Anna's Taqueria 5/5 on #Yelp: Anna's Taqueria is one of my two favorite fast Mexican restaurants (the other is Bueno… http://bit.ly/Xr2rt #
  • got ruby 1.9 + ruby on rails 2.3.4 installed on my mac (snow leopard) last night. rvm caused more trouble than it was worth, so i removed it #
  • "hey can we find a Family Guy to sleep in?" – Alicia #
  • 1 play left. go lions!!! #

Powered by Twitter Tools