Sunday, July 20, 2008

CPU Cache Optimization

In modern processors, access to memory is a big problem since the data has a long way to travel from system RAM all the way to a register inside the CPU so an operation can be performed. And what is a CPU but nothing to read data, perform an operation based on it, and write it out?

When you ask for the contents of a memory location, some (or most) of the processors will actually read a chunk the address you requested, plus a chunk surrounding it. This is because the data transfers are usually initiated by a DMA request to the memory controller from the CPU.

When the CPU executes a memory access instruction, a check is performed to see if that region of memory is in the cache. If it isn't then the memory is requested to the mem controller, while causing a cache miss.

So, for example, let's say a particular processor reads 128 byte chunks. If you have a big object or structure, and only need to access a particular member, the CPU reads the closest 128 bytes (according to our example). So if you only read an int from that structure, you're wasting 124 bytes from that read.

Let's say you write an int into that structure. The CPU will read a 128 byte chunk into the cache, then execute your write operation, and then send the chunk back to system memory.


Example Problem

struct SMyData
{
  char const* m_pDebugInfo;
  bool m_bReadData;
  int m_nWriteData;
  float m_fSomeDataNotNeededCurrently;
};

void UpdateDataList( SMyData* pList, int nNodes )
{
  for( int i = 0; i < nNodes; i++, pList++ )
  {
    // Some condition that reads a member of the list
    if( pList->m_bReadData )
    {
      // Mark as dirty
      pList->m_nWriteData = 1;
    }
  }
}


Seems like a pretty innocent function, right? The problem is when this is done too many times; this basically hammers the processor caches! If there were other threads performing memory accesses, the CPU(s) will have to stall until the memory controller can fill the L2 cache. And this happens quite a lot on games, specially with the current generation on CPUs. We're all moving to multi-core architectures, so there's almost a guarantee the CPU(s) will stall waiting for the L2 cache miss.

So how do we fix it?

We know the data format, right? We can see that out of the four member variables, only two are used. And one is used as read-only, while the other is used as write-only. We can help the CPU process this data in a cache friendly fashion. Basically, we want to tell the CPU to read some memory as read-only, and another as write only. So if we could somehow split the data...

So if we could change to something like:


void UpdateDataList( bool const* pReadData, int* pWriteData, int nNodes )
{
  for( int i = 0; i < nNodes; i++, pReadData++, pWriteData++ )
  {
    // Some condition that reads a member of the list
    if( *pReadData )
    {
      // Mark as dirty
      *pWriteData = 1;
    }
  }
}


By changing this, you are guaranteed a performance boost in your app. Of course, the readability of the source data will be compromised, since now you no longer have Arrays of Structures, but rather Structures of Arrays, but you can't have everything in the world, can you? And typically the need for this arises in tight loops inside the game.

There's one extra optimization for the cache that we'll delve into on our next installment...

No comments: