Perfomance question need some explanation

v1vas · February 2, 2012, 2:44pm

Card: GTX 580

Grid: 2x2x256

Can somebody explain me please, why the second version of kernel works twice as faster as first one ?

Version 1:

global void Kernel(CUDAThreadData* threadData)

{
int nBlock = blockIdx.y * gridDim.x + blockIdx.x;

int index = nBlock * blockDim.x + threadIdx.x;



DWORD* dataTable = threadData[index].dataTable;
<… mathematical operations with dataTable with random memory access …>

Version 2:

global void Kernel(CUDAThreadData* threadData)

{
int nBlock = blockIdx.y * gridDim.x + blockIdx.x;

int index = nBlock * blockDim.x + threadIdx.x;



DWORD dataTable[TABLE_SIZE_DWORDS]; //TABLE_SIZE_DWORDS = 0x310

memcpy(dataTable, threadData[index].dataTable, TABLE_SIZE_DWORDS * 4);
<… mathematical operations with dataTable with random memory access …>

As far as i understand both dataTable’s should reside in global memory. What’s changed ?

tera · February 2, 2012, 4:44pm

The copy to local memory optimizes the data layout. In the first variant, reading any word from *dataTable requires a full cacheline to be read from memory for each thread (neglecting the initial copy). In the second variant, reading a single cachline is sufficient, reducing the required bandwidth to 1/32.

v1vas · February 3, 2012, 12:04am

Thank you.