Hi,
I’m quite new to cuda ( one mounth now ) and i though i understood what was coalescing access but it seems in fact i dont.
here are my two simples kernels supposed to show the benefit of coalescing access :
global void coaTest(CUDA_PICTURE_PLANAR_FLOAT cudaPic,int size){
int x = threadIdx.x + blockIdx.x * blockDim.x;
if ( x < size/3){
float pixp0 = cudaPic.plan0_cuda[x*3];
float pixp1 = cudaPic.plan1_cuda[x*3];
float pixp2 = cudaPic.plan2_cuda[x*3];
/*cudaPic.plan0_cuda[x*3]= pixp0;
cudaPic.plan1_cuda[x*3]= pixp1;
cudaPic.plan2_cuda[x*3]= pixp2;*/
}
}
that one above is not supposed to be coalescent (from what i know), i access none consecutive memory adress ( 1 in 3 )
global void coaTest(CUDA_PICTURE_PLANAR_FLOAT cudaPic,int size){
int x = threadIdx.x + blockIdx.x * blockDim.x;
if ( x < size/3){
float pixp0 = cudaPic.plan0_cuda[x];
float pixp1 = cudaPic.plan1_cuda[x];
float pixp2 = cudaPic.plan2_cuda[x];
/*cudaPic.plan0_cuda[x]= pixp0;
cudaPic.plan1_cuda[x]= pixp1;
cudaPic.plan2_cuda[x]= pixp2;*/
}
}
this one should be coalescent because of consecutive access in memory,
yet, i used nsight performance analyser and that didnt show any time differences for the execution, they were both 30 microsec.
i indeed saw noticeable difference when kernels write into memory, when i uncomment :
cudaPic.plan0_cuda[x]= pixp0;
cudaPic.plan1_cuda[x]= pixp1;
cudaPic.plan2_cuda[x]= pixp2;
it showed 230 for the non coalescent kernel and 99 for the coalescent.
Still i cant explain why it only work when kernels write in memory ?
Sorry for my english.
Testi