The physical structure of a C++ program is very important yet it is often overlooked. This two-part article will attempt to explain why the physical structure of a program is so important, present some useful guidelines, and show its effect on compile times.
A few days ago, Tom Whittaker (a friend who is currently working at Firaxis) emailed me with some surprising facts about the behavior of the #include directive in the C++ compiler of Visual Studio .NET. It turns out that in my book, I hinted that compilers often optimize the include step for header files and so they don’t pay the costs of opening a file and loading it. When Tom turned on the /showincludes flag on the compiler, he saw that the same file was repeatedly being included by the compiler, even if it had internal include guards. He ran several interesting tests and I’ll add a link from here whenever he gets around to putting them up on his web site (hint, hint, Tom).
All of that made me think again about the physical structure of a C++ program, how important it is, and how often it is overlooked. This two-part article will attempt to explain why the physical structure of a program is so important, present some useful guidelines, and show its effect on compile times.
We are all familiar with the logical structure of a program. It deals with classes and functions and namespaces and templates. That’s what you learn about in school, and what you read about in most C++ books. Design patterns, object oriented programming, etc, etc, all deal with the logical structure of a program. And, truth be said, it really is the most interesting part.
The physical structure of a program deals with the files that make up its source code. The .cpp and .h files, how they include each other, how they’re subdivided into directories, etc. While not as interesting and sexy as the logical structure, it is crucial to understand the consequences of the physical layout of a program for any real-world project of any significant size. That includes just about every modern PC and console game, but probably not handheld devices because of their small code size.
Because it’s not a particularly hot topic, not many books talk about the physical structure of a program. The best book in the subject is Large Scale C++ Software Design by John Lakos. No technical lead should work on a game without at least having read parts of that book. Yes, some of the advice is a little outdated by today’s standards (it’s an almost antique book in the computer world–almost 10 years!), and some parts are a bit long winded with detailed measurements. Skip those in your first read and you’ll still get a lot of gems along the way. Every time I come back to that book I end up getting something new out of it. I haven’t read it in about 3-4 years, so it’s just about due for another read.
A good physical structure will result in files without many dependencies to other files, and with files clearly grouped in cohesive modules or libraries. On the other hand, a bad physical structure is one where files are related to other files from all over the project, without clear delimitations or boundaries. This is a clear example of the “blob” antipattern. Unfortunately, if left unchecked, this is the type of physical structure that develops over time.
Benefits of a Good Physical Structure
Why would anybody care about the physical structure? After all, the things we care about when we’re writing a program are that it does what it’s supposed to do, and that it does it fast. That might be true for a demo, or a throwaway project, but for a large project, maintainability is also a very important requirement. It’s no good to have a very efficient class if we can’t change it, and it’s also no good if iteration to test a change takes a long time.
The benefits of a good physical structure of a program are:
- Better logical structure. Usually, keeping an eye on the physical structure of a program results in a better logical structure. By reducing the dependencies between files, we will probably reduce dependencies between classes. It is often the case that unexpected connections will grow between classes and even different libraries if their header files are already included. Programmers won’t have anything to remind them that they’re just adding a new dependency when they decide to call some global function.One of my pet-peeves when programming for Windows is having windows.h included in every single .cpp file, either directly or indirectly. In the type of programs I write, most classes don’t need anything in windows.h, yet it is forced down their throats. At one point I attempted to remove a global windows.h include in a library since it was supposedly not needed anywhere, just to have to give up because of the many unnecessary DWORD, BOOL, and screwed up min and max calls scattered everywhere on the source code.
- Easier to refactor. If you think of each file as a little box literally connected with a string to all the other files it depends or is dependent on, the more of those strings there are, the harder it is to untangle it from the overall mess and separate it. It is the same thing with refactoring: The worse the physical structure, the harder it is to make any refactoring changes that involve separating or isolating sections (which, in my experience, it’s one of the most crucial refactorings you have to do to prevent programs from growing into the “blob” antipattern).
- Easier to test. Not surprisingly, the more modular and independent the project is, the easier it becomes to write unit tests for it since each piece can be tested separately from the rest. Unit tests really benefit from having very few dependencies, so one of the many benefits reaped by doing test-first development is a very modular design with a logical and physical structure that has very few dependencies.
- Faster compile times. This might come as a surprise to some people, but it is usually the most tangible and objective result of having a good (or bad!) physical structure. Usually, the worse the physical structure, the longer the compile times will be. Compile times are an issue for large projects. Even with the fastest machines today, full builds on large projects can easily take hours. More importantly, builds caused by changing just a file or two can trigger builds that last almost that long. Needless to say, having such a delay every time a change is made is not exactly encouraging programmers to test their work and iterate it to make it better. Part two of this article will look exclusively at compile times.
Here’s a distilled set of guidelines from Lakos’ book that minimize the number of physical dependencies between files. I’ve been using them for years and I’ve always been really happy with the results.
- Every cpp file includes its own header file first. This is the most important guideline; everything else follows from here. The only exception to this rule are precompiled header includes in Visual Studio; those always have to be the first include in the file. More about precompiled headers in part two of this article.
- A header file must include all the header files necessary to parse it. This goes hand in hand with the first guideline. I know some people try to never include header files within header files claiming efficiency or something along those lines. However, if a file must be included before a header file can be parsed, it has to be included somewhere. The advantage of including it directly in the header file is that we can always decide to pull in a header file we’re interested in and we’re guaranteed that it’ll work as is. We don’t have to play the “guess what other headers you need” game.
- A header file should have the bare minimum number of header files necessary to parse it. The previous rule said you should have all the includes you need in a header file. This rule says you shouldn’t have any more than you have to. Clearly, start by removing (or not adding in the first place) useless include statements. Then, use as many forward declarations as you can instead of includes. If all you have are references or pointers to a class, you don’t need to include that class’ header file; a forward reference will do nicely and much more efficiently.
One unfortunate aspect of those guidelines is that the compiler doesn’t really care one way or another. As long as you provide enough includes, the compiler will happily churn away at the source code and come up with the desired object file. It is up to us to minimize the number of includes and to follow those rules. While it seems fairly straightforward at first (after all, it is only three rules), things get more complicated as soon as heavy refactoring starts. As you split classes, move functions, and consolidate functionality, there might be several unnecessary headers. The only quick way to verify whether they’re needed or not is to, gulp, comment them out and try to compile the file.
This situation is more common in large files, which are the ones you’re most likely going to be refactoring. If left unchecked, you’ll soon be left with a myriad little tendrils connecting your file to the rest of the code without getting any benefit from it. Wouldn’t it be great if there was an automated tool that would check that?
Automating the Guidelines
I did a quick search and I couldn’t find any tools or scripts that did exactly what I wanted. I suppose that a massive C/C++ style-checker tool might look for some of those things (like redundant includes), but nothing jumped out. Most programs are more concerned with checking logical errors and constructs than looking at the physical structure. I started from Scott Meyers’ summary of major C++ checkers. In particular I looked at CodeCheck, PC-Lint, and CodeWizard. I admit that I didn’t look too deep into them, so maybe they also check for some of these guidelines, but it’s certainly not their biggest selling point from reading their web sites.
As a quick challenge, I decided to try and write a quick script to check against those guidelines. Soon I realized that you can’t really have the word “quick” in anything related to parsing C++. I was quickly reminded how ugly the language is, how many quirks it has, how much baggage it a carries around. Java looks mighty tempting sometimes.
I decided that if I was going to have a chance to do this, I would need to leverage other software to parse the source code for me and the script could work directly on the abstract representation of the program. Not exactly what I had in mind, but I stumbled on the XML and perlmod output generated by Doxygen. I have been using Doxygen for years, but I always thought of it as a pretty documentation generator. I never realized what a powerful and robust C++ parser it was until now.
In a few hours I was able to put together a quick Perl script that hooks up to the perlmod output of Doxygen that checked against those rules. Guideline #1 was the easiest one to check against. Really, you don’t even need a fancy parser for that. Guideline #2 is already checked by the compiler (if you don’t have enough includes, the program won’t compile). Guideline #3 is the trickiest one. Still, the script makes a valiant effort and checks for the most common cases. It’ll try to detect whether an include is not needed at all, or whether it can be replaced with a forward declaration.
However, the script is a conservative one and will not always detect that a header is necessary. It deals fine with simple constructs such as member variables, enums, references, and pointers. However it seems that Doxygen has very little knowledge of preprocessor directives, so it won’t catch any #defines brought in from header files. Doxygen also seems very limited in how it deals with inline functions (probably because Doxygen is looking mostly at the logical rather than the physical structure of the program). It would probably be possible to deal with templates, but I didn’t bother making it that far. The script also detects duplicate includes, both in header files and cpp files.
Even with all those limitations, the script can easily deal with about 80% of the cases in most of the code I work with on a regular basis. It was certainly insightful running it on the source code for some of the libraries I work with. I saw plenty of instances where headers could in fact be removed and the physical structure of the program improved. I would love to see a robust tool along these lines that could be ran quickly enough after each compile, or at least once a night in our automated build machine.
Part two of this article will look at the consequences of physical structure on compile times, and what we can do to improve that.