LeChimp vs. Dr. Chaos

It’s no secret that I’m a big fan of unit tests. They provide a huge safety net for refactorings, double check the code logic, and prevent code rot. In addition, unit tests written through Test-Driven Development help define the architecture and keep programmers happy. They’ll even catch a bug or two along the way, but if you rely on them as your only way to catch bugs, you’re in for a surprise.

Bug Hunting

Unit tests, by their very nature, are limited to a single class or function at the time. There are all sorts of complex interactions between objects and systems that they simply can’t test. Even if you use mock objects and are extremely careful to test all your object interactions, there will be lots of unexpected cases and bugs that crawl out while running the game under real world conditions. Dr. Chaos is alive and well.

Besides, unit tests just test that the code does what you think it should do. So if the algorithm you have in your head is completely wrong, they’ll just make sure that the implementation is as broken as you imagined. Bugs like that might only surface once the code is put in the context of the whole game and interacts with other systems.

Traditionally, game companies rely on QA to uncover and squash most of the bugs created by complex interactions. You know, the type of bug that gets only triggered when the player enter the cave while there’s a blue moon and he spun around in place a seven times. Not only is this is an expensive process, but it’s not even very good at uncovering all the bugs. Most games have millions and millions of combinations of possibilities and interactions, and it’s completely impractical to try to run through all of them by hand.

Our Hero: LeChimp

At Power of Two Games it’s just two of us. So no QA or even interns to play the game endlessly. But, even in preproduction, we can’t afford to ignore those types of bugs. Instead, we enlisted the help of our hero: LeChimp.

LeChimp is our functional test server. It tirelessly runs the game every couple of hours and makes sure it loads and runs without any problem. Sure, ideally it should run more frequently, but LeChimp doubles up as our build server, and we don’t have another good computer to spare (maybe if y’all bought more t-shirts we could afford to buy another cheapo Dell).

Running the game is a good start. It checks that it’s possible to load every level and that nothing crashes. Frankly, that’s a good percentage of what QA does a lot of the time, and a lot of teams would really benefit from having such a simple test and know as soon as a level stops loading. Still, it we can do much better than that.

Monkey business

lechimpLeChimp runs the game for a fixed number of frames, and makes sure the game doesn’t crash or hits any asserts. But running the game without any action going on is not very useful, so it runs it with the -monkey switch, which feeds pseudo-random input to the game, as if a monkey were playing the game.

Actually, the input from -monkey is not random at all. I first made it truly random by generating new inputs through std::rand() every frame, but it looked more like a monkey on crack was playing the game and the characters shook violently back and forth and never managed to do anything interesting. Instead, now the monkey input holds the controller sticks and the buttons for some varying time interval, and it looks more like a monkey without its ADD medicine, which is clearly a step up.

When I first heard of this technique at a GDC tutorial a few years ago, I thought it would be pretty useless. How many bugs could feeding random input to the game really uncover? Shouldn’t we try to be doing something more intelligent to mimic how players interact with the game? But since it was really easy to implement I decided to give it a try. Boy, was I in for a treat! Within a few runs, it uncovered several major bugs that nobody hadn’t seen during their runs of the game. Since then, there isn’t a game I work on that doesn’t get treated with some monkey input love..

It’s so simple to implement that if you haven’t done it already, I really encourage you to go and do it right away. Do it over your next lunch break even. I guarantee you’ll be amazed at what it uncovers (or I’ll refund your money for this article :-) .

Recording for posterity

In addition to just trying to crash the game, LeChimp records all the inputs and the state of the game at every frame. Then, after running the game for a while, it runs it again, feeding it the recorded input, and verifies that the game is in the same state as it was during the recording session (every enemy is in the same place, every prop has the same orientation, every player has the same health, etc). This verifies that the game is fully deterministic, that is, given the same inputs, it always produces the same output.

We’re planning on some cool features that rely on the game being fully deterministic, so this something very important to us. But even if we weren’t planning on doing anything with it, determinism is an extremely useful feature to have when it comes time to track down bugs. Instead of getting lengthy descriptions of a crash from testers (which somehow always seems impossible to reproduce), they can include the playback file that lead to the crash along with the bug report, allowing us to catch it in the debugger in no time. Of course, it’s not like we have actual testers, but we affectionately think of LeChimp as one, and he’s always very careful to save all his playback files with every functional test run.

Right now we’re just recording the inputs to the game: delta time every frame and controller inputs. Just with that, the game runs exactly the same time in and time out (thanks to Havok for being deterministic!). Things get more complicated as soon as multiple threads are involved, since the exact timing of context switches between threads can affect the output. Some tools out there, like ReplayDirector claim to address this, but I haven’t looked into it very much.

Both the input file and the same state file are opened, written to, and closed every frame. That way there is no data loss if the game crashes unexpectedly and you get all the input leading up to that frame.

As far as checking that the world state is the same, it’s totally an ad-hoc process. We simply pick some of the obvious state and save them to a file: player positions, enemy position, props transforms, etc. If we ever see something get out of sync, we add it to the game state that gets saved and compared so it doesn’t happen in the future.

No waiting around

So far we have LeChimp running the following with every functional test:

sweetpea -frames=10000 -record=functional_test -monkey -level=level_name

followed by

sweetpea -playback=functional_test -level=level_name

At 60 Hz, running 10,000 frames is almost three minutes. 10,000 is just a number we pulled out of a hat. The longer you let the monkey loose with the game, the better the chance of uncovering something. Multiply that by the number of levels and sandboxes, and the functional test now takes quite a while to complete.

A good chunk of the time of the functional test is spent rendering each frame and waiting for the vertical sync signal. But nobody is actually looking at the output [2], so why bother?

We added a couple more command line switches to make the game run without rendering or waiting on vsync. To make that really useful, we also added the ability to force the frame time to be a fixed timestep. So we can run the game like this:

sweetpea -frames=10000 -record=functional_test -monkey -level=level_name -render=no -vsync=no -timestep=0.01566

The game will cruise through the simulation as fast as possible, often cramming all three minutes of gameplay into 10 seconds or so. Perfect for poor monitor-less LeChimp.

Testing, testing

There’s even more to LeChimp that just monkeying around. It also runs several other functional tests checking some high-level functionality:

  • Player attacks. The player character attacks enemies using each different type of attack and verifies that each attack is successful. This is particularly useful when there are a few types of attacks (which themselves are unit tested), but there are many combinations that can be created between attacks and targets.

  • Level restart. LeChimp loads a level and then restarts it hundreds of times, checking for crashes and memory leaks.

  • Torture chamber. A tiny level in which the hero (in god mode) frantically mows down hundreds of enemies that get immediately respawned. This is a perfect stress test for performance and hardcoded limits.

We’re writing these tests as we go along. Whenever there’s a feature that seems complex enough, or that it relies on other systems, or that it seems to break repeatedly, we take a few minutes, write a new functional test, and throw it to LeChimp to run with all the others.

Functional tests like these are about as high level as it gets. They deal with actions such as “move the player to the right”, “spawn an enemy here”, or “perform special attack XXX”, so it would make sense to implement them in the same way you implement game logic (which in our case is still C++, although we’re considering a switch to Lua in some not very distant future).

Long Live LeChimp

LeChimp has been invaluable battling against Dr. Chaos. Several times I made a refactoring or introduced a new feature, all the unit tests passed, I checked it in, and a little while later LeChimp screams at us that something is wrong. Once we see the functional test fail, it’s usually pretty obvious how to fix it: a memory pool is too small, or a combination of events that causes the player to enter some unexpected state. Fixing it is a matter of writing a unit test, fixing the logic, and checking it in, all in a few minutes.

A few times, however, the problem hasn’t been that obvious. The world gets out of sync, but only in release mode. Running it from the debugger often results in yet a different state. Sounds like some annoying memory overwrite, or perhaps some uninitialized memory. Any programmer who’s had to deal with this before probably has shivers running down his back. Fortunately, LeChimp has a secret weapon in its arsenal to deal with that. But that’s another story and shall be told another time.

Until then, happy holidays, everybody!

[1] Don’t underestimate the power of keeping programmers happy! At a previous company I used to work for, a manager rewarded one of the programmers by buying him a DVD set of a TV show he was really into. That was only about $40, but they had a huge effect on the programmer’s morale and productivity. Talk about well-spent money.

[2] LeChimp doesn’t even have a monitor, although that sucks sometimes. RemoteDesktop is pretty cool, but it locks up the DirectX surfaces in some weird way, and then the graphics renderer refuses to initialize correctly. So we’re forced to use… get this… NetMeeting! With fake phone rings and all! Ring, ring, calling LeChimp…

  • http://www.mach8.nl Lucas Meijer

    Hey Noell, you might want to look at RealVNC (You can use the free version).
    I use to log into my functional test server, and it’s able to work with directx
    surfaces just fine. (alltough it’s always slower than on your local machine offcourse).

    Bye, Lucas

  • http://sirfire.hopto.org/ SirFire

    Try radmin for remote desktop, I’ve found it works very well with directx surfaces.

  • http://phil.freehackers.org Bluebird

    One of the consequences of using a lot of unit tests is that it becomes usually very easy to build functional tests on top of your architecture. I very often end-up running a lot of functional tests, with exactly the same unit test library.

  • noel

    That’s interesting, because we’re using unit tests very, very aggressively, and we have thousands of unit tests, but we don’t use the same framework for functional tests at all. For unit tests we’re using UnitTest++ (surprise!), but for functional tests we found the need to run the whole game and control it in a different way. So I ended up using our "action" system, which is a way of running bits of code that can persist across frames. They were originally set up so that any part of the game could kick off these actions through an enum and/or an action info structure (so, all plain data without any dependencies). But now it’s really handy to run them directly from the command line. So we run unit tests this way: sweetpea -action=TortureChamberFunctionalTest, which starts out the game, and immediately puts an action of that type in the action list.

    Do you find that you run functional tests on the whole game or more like full subsystems? What kind of unit test library are you using?

     

  • eugene

    That reminds me, noel, when are you guys going to release UnitTest++ 1.4? :)

  • http://phil.freehackers.org Bluebird

    Well, my program are not games and are much much smaller than what you guys are doing. That explains probably why I can cope with using unit tests for functional tests.

    To answer your question, that’s more like a subsystem functional test. It’s less than full functional test and more than simple unit test. It may combine 2 to 4 classes together to represent realistic usage. It is an reasonably easy way to trigger fancy states in a few classes, that are difficult to reproduce on pure unit testing or pure functional testing. And thank to the influence of unit test, like I said, it is very easy to assemble a few classes into a functional subsystem test.

    For really high level functional tests, a new way of running the program is necessary.

    As for testing library, I used to use cppunit and qtunit but I was disapointed by both. I can’t wait to se UnitTest++ for my next C++ program but I’m doing either C or python those days.