constant memeory cache performance test Testing constant memory bandwidth anomalies

Constant memory access does not seem to conform to expectations, in that the constant cache hits does not seem to improve performance. Suppose I have the following kernel:

[codebox]

#define ARRAY_SZ 1024

constant float s_rFloat_C[ ARRAY_SZ ];

global void read( float* poDst_D )

{

float rSum = 0;

for( unsigned int j = 0;

       j  <  (1 << 18) / ARRAY_SZ;  // keeping the total number of iterations constant

       ++j )

{

    for( unsigned int i = 0; i < ARRAY_SZ; ++i )

    {

        rSum += s_rFloat_C[ i ];  // All threads access the same location

    }

}

// Need to use rSum to effect output, otherwise compiler optimises away the loops above

// and minimize time taken to write to output.

if(  threadIdx.x + threadIdx.y + blockIdx.x + blockIdx.y == 0 )

{

    poDst_D[ 0 ] = rSum;

}

}

[/codebox]

I would expect that when ARRAY_SZ is larger than the constant cache, then performance would drop dramatically. But results are not what I expect:

[codebox]

ARRAY_SZ, Size in Bytes, Time ( ms )

2^8 1K 370.243011

2^9 2K 369.931000

2^10 4K 364.967987

2^11 8K 357.802002 // the constant data no longer fits entirely into cache, but no performance drop

2^12 16K 353.917999

2^13 32K 352.893005

2^14 64K 367.192993

[/codebox]

I also did a sanity check where I read from global memory and it is 10 times slower than using constant memory. So I’m thinking that the constant cache is not in use at all and I’m getting speed-up purely from predictive loading of const mem.

Hardware: 9800GT

Launch config:

dim3  blockSize( 32, 16 );

dim3  gridSize( 10, 10 );

Thanks for reading this far, and double thanks in advance for opinions. :)

For a block size of 512

Say you added the threadIdx to the index you are reading from and advanced the index by 512 every iteration - this I believe would make the const cache (work 16 times harder and) fail miserably when it ends, rather than amortizing the cache miss over 512 threads.

For a block size of 512

Say you added the threadIdx to the index you are reading from and advanced the index by 512 every iteration - this I believe would make the const cache (work 16 times harder and) fail miserably when it ends, rather than amortizing the cache miss over 512 threads.

Since all your threads in all blocks are accessing exactly the same addresses in constant memory at the same time, you are effectively just reading one megabyte times the number of block waves necessary to execute 100 blocks, or five megabytes for your device. With your measured execution times, this equates to a constant memory bandwidth of roughly 15 MB/s that is required, out of the 57.6 GB/s offered by the device. I.e., your kernel is not at all bandwidth bound, even if the array overflows the constant cache and has to be re-read on each iteration.

Since all your threads in all blocks are accessing exactly the same addresses in constant memory at the same time, you are effectively just reading one megabyte times the number of block waves necessary to execute 100 blocks, or five megabytes for your device. With your measured execution times, this equates to a constant memory bandwidth of roughly 15 MB/s that is required, out of the 57.6 GB/s offered by the device. I.e., your kernel is not at all bandwidth bound, even if the array overflows the constant cache and has to be re-read on each iteration.

Thanks tera and jma,

You are both quite right, the number of operations across all threads is in the order of a tera flop, so it looks like it’s definitely processing bound.

Thanks tera and jma,

You are both quite right, the number of operations across all threads is in the order of a tera flop, so it looks like it’s definitely processing bound.