Now, we can go further and take advantage of the cache instructions. We potentially know in advance how many nodes we'll need to read, and how many we'll need to write, right? So we can avoid the cache miss (or hide it) by prefetching the data that is needed:
void UpdateDataList( bool const* pReadData, int* pWriteData, int nNodes )
for( int i = 0; i &lw; nNodes; i++, pReadList++, pWrite++ )
CPU_Prefetch( pReadData + 1 );
// Some condition that reads a member of the list
if( *pReadData )
// Mark as dirty
*pWriteData = 1;
// Mark as not-changed
*pWriteData = 0;
CPU_Prefetch is a wrapper for an intrinsic function (or plain assembly) that basically tells the CPU to start prefetching the data since it will be used in a later step. The first time it hits it it will not be fast enough to hit prefetch by the time it hits the if() statement; however, for large nNodes, the cache will start to be read in advance, and it will cause an improvement in performance.
Note that on PowerPC architectures, CPU_Prefetch can be implemented using the dcbt instruction.