Low global load efficiency, why? I think this access pattern should be coalesced

Hi,

I have problem with loads from global memory. This are fragments of my code:

__global__ 

void kernel (const float* srcData,

	(...)

{

const int tid = threadIdx.x;

float4 tmp;

(...)

for(j = 0; tid + blockDim.x*j < size; ++j)

{

	tmp.x = srcData[blockOffset + tid+blockDim.x*j];

	tmp.y = srcData[blockOffset + tid+blockDim.x*j + size];

	tmp.z = srcData[blockOffset + tid+blockDim.x*j + size*2];

	tmp.w = srcData[blockOffset + tid+blockDim.x*j + size*3];

	//some computations using local and shared mem

}

(...)

This is the only fragment of kernel, where I read from global mem.

I think, that load should be coalesced, because of use of tid variable all threads in warp reads floats, that are continuous in global mem.

But profiler shows me, that Global Load Efficiency is 2.3%. God why?

This is output from Tesla C2075 profiling, on my laptop on 540GT M I have about 5% efficiency of global load.

The srcData pieces of data per block aren’t 128 bit aligned, but that means, that every warp will have max two L1-cache-line-size transactions.

So why I’m having so poor global load efficiency?

Any ideas where I am wrong?

Try using float4 structure this way you only need one call tmp=srcData[ind]; Try to rewrite the loop. Your code is for a general case, but you cuold unroll the loop for a few particular cases.

Try to rearrange the computation so the indexing is of the form

array[constant_value+tid]

Then it should be a lot better.

Hi,

I think that Visual Profiler is giving me false feedback.

I make test and adjust input data size, so portions for blocks are 128-bit aligned (and I did nothing to kernel code). In that case Profiler shows 100% Load Efficiency as expected.

However, this was change from 1532 floats to 1536 floats per portion, and I haven’t observe any speed difference (in both cases application run time was 31 seconds, but Profiler shows 100% vs 2.3% Load Efficiency in memory bound kernel when kernel is about 90% of application run time).

Strange to me, but it seems, that Visual Profiler is giving me wrong feedback.

This happens very often to me. The bandwidth seems to have the same problem. Sometimes I have very low bandwith which shouldn’t be the case and sometimes it’s over the bandwith which is possible with that hardware.

Very frustrating…