Coalescing access

Hi,

I’m quite new to cuda ( one mounth now ) and i though i understood what was coalescing access but it seems in fact i dont.

here are my two simples kernels supposed to show the benefit of coalescing access :

global void coaTest(CUDA_PICTURE_PLANAR_FLOAT cudaPic,int size){

int x = threadIdx.x + blockIdx.x * blockDim.x;

if ( x < size/3){

float pixp0 = cudaPic.plan0_cuda[x*3];
float pixp1 = cudaPic.plan1_cuda[x*3];
float pixp2 = cudaPic.plan2_cuda[x*3];

/*cudaPic.plan0_cuda[x*3]= pixp0;
cudaPic.plan1_cuda[x*3]= pixp1;
cudaPic.plan2_cuda[x*3]= pixp2;*/


}

}

that one above is not supposed to be coalescent (from what i know), i access none consecutive memory adress ( 1 in 3 )

global void coaTest(CUDA_PICTURE_PLANAR_FLOAT cudaPic,int size){

int x = threadIdx.x + blockIdx.x * blockDim.x;

if ( x < size/3){

float pixp0 = cudaPic.plan0_cuda[x];
float pixp1 = cudaPic.plan1_cuda[x];
float pixp2 = cudaPic.plan2_cuda[x];

/*cudaPic.plan0_cuda[x]= pixp0;
cudaPic.plan1_cuda[x]= pixp1;
cudaPic.plan2_cuda[x]= pixp2;*/


}

}

this one should be coalescent because of consecutive access in memory,

yet, i used nsight performance analyser and that didnt show any time differences for the execution, they were both 30 microsec.

i indeed saw noticeable difference when kernels write into memory, when i uncomment :

    cudaPic.plan0_cuda[x]= pixp0;
cudaPic.plan1_cuda[x]= pixp1;
cudaPic.plan2_cuda[x]= pixp2;

it showed 230 for the non coalescent kernel and 99 for the coalescent.

Still i cant explain why it only work when kernels write in memory ?

Sorry for my english.

Testi

Hi,

I strongly suspect that since your code doesn’t do anything, the compiler simply super-optimise it by removing it… ;)
It explains why you only see a difference when you actually write in memory, since here, the compiler probably kept the code (even though it might have ignore it too since it doesn’t do anything useful)
This is just a guess of course.

Gilles

…and a correct guess, at that. :) The CUDA compiler aggressively eliminates dead code (important, since it inlines everything by default), so you cannot test performance by removing the code that writes to global memory.

I would suggest looking at documentation on the CUDA profiler, as there are counters which will help you understand how many memory transactions are used to fetch your data.

Yep, I ran some tests and it seems you were correct

Thank you

Testi