Data-Oriented Design Now And In The Future

There has been a lot of recent discussion (and criticism) on Data Oriented Design recently. I want to address some of the issues that have been raised, but before that, I’ll start with this reprint from my most recent Game Developer Magazine. If you have any questions you’d like addressed, add write a comment and I’ll try to answer everything I can.

Last year I wrote about the basics of Data-Oriented Design (see the September 2009 issue of Game Developer). In the time since that article, Data-Oriented Design has gained a lot of traction in game development and many of teams are thinking in terms of data for some of the more performance-critical systems.

As a quick recap, the main goal of Data-Oriented Design is achieving high-performance on modern hardware platforms. Specifically, that means making good use of memory accesses, multiple cores, and removing any unnecessary code. A side effect of Data-Oriented Design is that code becomes more modular and easier to test.

Data-Oriented Design concentrates on the input data available and the output data that needs to be generated. Code is not something to focus on (like traditional Computer Science), but is something that is written to transform the input data into the output data in an efficient way. In modern hardware, that often means applying the same code to large, contiguous blocks of homogeneous memory.

Applying Data-Oriented Design

Itâ€™s pretty easy to apply these ideas to a self-contained system that already works over mostly-homogeneous data. Most particle systems in games are probably designed that way because one of their main goals is to be very efficient and handle a large number of particles at high framerates. Sound processing is another system that is naturally implemented thinking about data first and foremost.

So, whatâ€™s stopping us from applying it to all the performance-sensitive systems in a game code base? Mostly just the way we think about the code. We need to be ready to really look at the data and be willing to split up the code into different phases. Letâ€™s take a high-level example and see how the code would have to be restructured when optimizing for data access.

Listing 1 shows pseudocode for what could be a typical update function for a generic game AI. To make things worse, that function might even be virtual and different types of entities might implement it in different ways. Letâ€™s ignore that for now and concentrate on what it does. In particular, the pseudocode highlights that, as part of the entity update, it does many conditional ray casting queries, and also updates some state based on the results of those queries. In other words, weâ€™re confronted with the typical tree-traversal code structure so common in Object-Oriented Programming.

void AIEntity::Update(float dt)
{
    DoSomeProcessing();
    if (someCondition && Raycast(world))
       DoSomething();
    if (someOtherCondition && BunchOfRayCasts(world))
       DoSomethingElse();
    UpdateSomeOtherStuff();
}

Listing 1

Ray casts against the world are a very common operation for game entities. Thatâ€™s how they â€œseeâ€ whatâ€™s around them and thatâ€™s what allows them to react correctly to their surroundings. Unfortunately, ray casting is a very heavy weight operation, and it involves potentially accessing many different areas in memory: a spatial data structure, other entity representations, polygons in a collision mesh, etc.

Additionally, the entity update function would be very hard to paralellize on multiple cores. Itâ€™s unclear how much data is read or written in that function, and some of that data (like the world data structure) might be particularly hard and expensive to protect from updates from multiple threads.

If we re-organize things a bit, we can significantly improve performance and paralellization.

Break Up And Batch

Without seeing any of the details of whatâ€™s going on inside the entity update, we can see the raycasts sticking out like a sore thumb in the middle. A raycast operation is fairly independent of anything else related to the entity, itâ€™s heavyweight, and there could be many of them, so itâ€™s a perfect candidate to break up into a separate step.

Listing 2 shows how the broken up entity update code would look like. The update is now split in two different passes: The first pass does some of the updating that can be done independently of any ray casts, and decides which, if any, raycasts need to be performed sometime this frame.

void AIEntity::InitialUpdate(float dt, RayCastQueries& queries)
{
    DoSomeProcessing();
    if (someCondition)
        AddRayCastQuery(queries);
    if (someOtherCondition)
        AddBunchOfRayCasts(queries);
}

void AIEntity::FinalUpdate(const RayCastResults& results)
{
    UpdateSomeOtherStuff(results);
}

Listing 2

The game code in charge of updating the game processes all AI entities in batches (Listing 3). So instead of calling InitialUpdate(), solve ray casts, and FinalUpdate() for each entity, it iterates over all the AI entities calling InitialUpdate() and adds all the raycast query requests to the output data. Once it has collected all the raycast queries, it can process them all at once and store their results. Finally, it does one last pass and calls FinalUpdate() with the raycast results on each entity.

RayCastQueries queries;
for (int i=0; i<entityCount; ++i)
    entities[i].InitialUpdate(dt, queries);

// Other update that might need raycasts

RayCastResults results;
for (int i=0; i<queries.count; ++i)
    PerformRayCast(queries[i], results);

for (int i=0; i<entityCount; ++i)
    entities[i].FinalUpdate(results);

Listing 3

By removing the raycast calls from within the entity update function, weâ€™ve shortened the call tree significantly. The functions are more self-contained, easier to understand, and probably much more efficient because of better cache utilization. You can also see how it would be a lot easier to parallelize things now by sending all raycasts to one core while another core is busy updating something unrelated (or maybe by spreading all raycasts across multiple cores, depending on your level of granularity).

Note that after calling InitialUpdate() on all entities, we could do some processing on other game objects that might also need raycast queries and collect them all. That way, we can batch all the raycasts at compute them all at once. For years, weâ€™ve been drilled by graphics hardware manufacturers how we should batch our render calls and avoid drawing individual polygons. This is the same way: By batching all raycasts in a single call, we have the potential to achieve much higher performance.

Splitting Things Up

Have we really gained much by reorganizing the code this way? Weâ€™re doing two full passes over the AI entities, so wouldnâ€™t that be worse from a memory point of view? Ultimately you need to measure it and compare the two. In modern hardware platforms, I would expect performance to be better because, even though weâ€™re traversing through the entities twice, weâ€™re using the code cache much better and weâ€™re accessing them sequentially (which allows us to pre-fetch the next one too).

If this is the only change we make to the entity update, and the rest of the code is the usual deep tree traversal code, we might not have gained much because weâ€™re still blowing the cache limits with every update. We might need to apply the same design principles to the rest of the update function to start seeing performance improvements. But at the very least, even with this small change, we have made it easier to parallelize.

One thing weâ€™ve gained now is the ability to modify our data to fit the way weâ€™re using it, and thatâ€™s the key to big performance gains. For example, after seeing how the entity is updated in two separate passes, you might notice that only some of the data that was stored in the entity object is touched from the first update, and the second pass accesses more specific data.

At that point we can split up the entity class into two different sets of data. One of the most difficult things at this point is naming these sets data in some meaningful way. Theyâ€™re not representing real objects or real-world concepts anymore, but different aspects of a concept, broken down purely by how the data is processed. So what before was an AIEntity, can now become a EntityInfo (containing things like position, orientation, and some high-level data) and AIState (with the current goals, orders, paths to follow, enemies targeted, etc).

The overall update function now deals with EntityInfo structures in the first pass, and AIState structures in the second pass, making it much more cache friendly and efficient.

Realistically, both the first and second passes will have to access some common data (for example the entity current state: fleeing, engaged, exploring, idle, etc). If itâ€™s only a small amount of data, the best solution might be to simply duplicate that data on both structures (going against all â€œcommon wisdomâ€ in Computer Science). If the common data is larger or is read-write, it might make more sense to give it separate data structure of its own.

At this point, a different kind of complexity is introduced: Keeping track of all the relationships from the different structures. This can be particularly challenging while debugging because some of the data belonging to the same logical entity isnâ€™t stored in the same structure and itâ€™s harder to explore in a debugger. Even so, making good use of indices and handles makes this problem much more manageable (see Managing Data Relationships in the September 2008 issue of Game Developer).

Conditional Execution

So far things are pretty simple because weâ€™re assuming that every AI entity needs both updates and some ray casts. Thatâ€™s not very realistic because entities are probably very bursty: sometimes they need a lot of ray casts, and sometimes theyâ€™re idle or following orders and donâ€™t need any for a while. We can deal with this situation by adding a conditional execution to the second update function.

The easiest way to conditionally execute the update would be to add an extra output parameter to the FirstUpdate() function indicating whether the entity needs a second update or not. The same information could be derived form the calling code depending on whether there were any raycasts queries added. Then, in the second pass, we only update those entities that appear in the list of entities requiring a second update.

The biggest drawback of this approach is that the second update went from traversing memory linearly to skipping over entities, potentially affecting cache performance. So what we thought was going to be a performance optimization ended up making things slower. Unless weâ€™re gaining a significant performance improvement, itâ€™s often better to simply do the work for all entities whether they need it or not. However, if on average less than 10 or 20 percent of the entities need a ray cast, then it might be worth avoiding doing the second update on all the other entities and paying the conditional execution penalty.

If the number of entities to be updated in the second pass is fairly small, another approach would be to copy all necessary data from the first pass into a new temporary buffer. The second pass can then process that data sequentially without any performance penalties and it would completely offset the performance hit of copying the data.

Finally, another alternative, especially if the conditional execution remains fairly similar from frame to frame, is to relocate entities that need raycasting together. That way the copying is minimal (swapping an entity to a new location in the array whenever it needs a raycast), and we still get the benefit of the sequential second update. For this to work all your entities need to be fully relocatable, which means working with handles or some other indirection, or updating all the references to the entities that swapped places.

Different Modes

What if the entity can be in several, totally different modes of execution? Even if itâ€™s the same type of entity, traversing through them linearly calling the update function could end up using completely different code for each of them, so it will result in poor code cache performance.

There are several approached we can take in a situation like that:

If the different execution modes also are tied to different parts of the entity data, we could treat them as if they were completely different entities and break each of their data components apart. That way, we can iterate through each type separately and get all the performance benefits.
If the data is mostly the same, and itâ€™s just the code that changes, we could keep all the entities in the same memory block, but rearrange them so that entities in the same mode are next to each other. Again, if you can relocate your data, this is very easy and efficient (it only requires swapping a few entities whenever the state changes).
Leave it alone! Ultimately, Data-Oriented Design is about thinking about the data and how it affects your program. It doesnâ€™t mean you always have to optimize every aspect of it, especially if the gains arenâ€™t significant enough to warrant the added complexity.

The Future

Is thinking about a program in terms of data and doing these kind of optimizations a good use of our time? Is this all going to go away in the near future as hardware improves? As far as we can tell right now, the answer is a definite no. Efficient memory access with a single CPU is a very complicated problem, and matters get much worse as we add more cores. Also, the amount of transistors in CPUs (which is a rough measure of power) continues to increase much faster than memory access time. That tells us that, barring new technological breakthroughs, weâ€™re going to be dealing with this problem for a long time. This is a problem we need to deal with right now and build our technology around it.

There are some things Iâ€™d like to see in the future to make Data-Oriented Design easier. We can all dream up of a new language that will magically allow for great memory access and easy paralellization, but replacing C/C++ and all existing libraries is always going to be a really hard sell. Historically, the best advances in game technology have been incremental, not throwing away existing languages, tools, and libraries (thatâ€™s why weâ€™re still stuck with C++ today).

Here are two things that could be done right now and work with our existing codebases. I know a lot of developers are working on similar systems in their projects, but it would be great to have a common implementation released publicly so we can all build on top of them.

Language

Even though a functional language might be ideal, either created from scratch or reusing an existing one, we could temporarily extend C to fit our needs. I would like to see a set of C extensions where functions have clearly defined inputs and outputs, and code inside a function is not allowed to access any global state or call any code outside that function (other than local helper functions defined in the same scope). This could be done as a preprocessor or a modified C compiler, so it remains very compatible with existing libraries and code.

void FirstEntityUpdate(input Entities* entities, input int entityCount, output RayCastQueries* queries, output int queryCount);

Dependencies between functions would be expressed by tying the outputs of some functions to the input of other functions. This could be done in code or through the use of GUI tools that help developers manage data relationships visually. That way we can construct a dependency diagram of all the functions involved in every frame.

Scheduler

Once we have the dependencies for every function, we can create a directed acyclic graph (DAG) from it, which would give us a global view of how data is processed every frame. At that point, instead of running functions manually, we can leave that job in the hands of a scheduler.

The scheduler has full information about all the functions as well as the number of available cores (and information from the previous frame execution if we want to use that as well). It can determine the critical path through the DAG and optimize the scheduling of the tasks so the critical path is always being worked on. If temporary memory buffers are a limitation for our platform, the scheduler can take that into account and trade some performance time for a reduced memory footprint.

Just like the language, the scheduler would be a very generic component, and could be made public. Developers could use it as a starting point, build on top of it, and add their own rules for their specific games and platforms.

Even if weâ€™re not ready to create those reusable components, every developer involved in creating high-performance games should be thinking about data in their games right now. Data is only going to get more important in the future as the next generation of consoles and computers rolls in.

This article was originally printed in the September 2010 issue of Game Developer.

Applying Data-Oriented Design#

Break Up And Batch#

Splitting Things Up#

Conditional Execution#

Different Modes#

The Future#

Language#

Scheduler#