Need unroll advice


Before, I used this code in my kernel :

for (int k = 0; k < BLOCK_DIM; ++k){

    tmp = shared_A[k][ty] - shared_B[k][tx];

    ssd += tmp*tmp;


Because I know that BLOCK_DIM=16, I have unrolled my loop like that :

tmp = shared_A[ 0][ty] - shared_B[ 0][tx]; ssd += tmp*tmp;

tmp = shared_A[ 1][ty] - shared_B[ 1][tx]; ssd += tmp*tmp;

tmp = shared_A[ 2][ty] - shared_B[ 2][tx]; ssd += tmp*tmp;

tmp = shared_A[ 3][ty] - shared_B[ 3][tx]; ssd += tmp*tmp;

tmp = shared_A[ 4][ty] - shared_B[ 4][tx]; ssd += tmp*tmp;

tmp = shared_A[ 5][ty] - shared_B[ 5][tx]; ssd += tmp*tmp;

tmp = shared_A[ 6][ty] - shared_B[ 6][tx]; ssd += tmp*tmp;

tmp = shared_A[ 7][ty] - shared_B[ 7][tx]; ssd += tmp*tmp;

tmp = shared_A[ 8][ty] - shared_B[ 8][tx]; ssd += tmp*tmp;

tmp = shared_A[ 9][ty] - shared_B[ 9][tx]; ssd += tmp*tmp;

tmp = shared_A[10][ty] - shared_B[10][tx]; ssd += tmp*tmp;

tmp = shared_A[11][ty] - shared_B[11][tx]; ssd += tmp*tmp;

tmp = shared_A[12][ty] - shared_B[12][tx]; ssd += tmp*tmp;

tmp = shared_A[13][ty] - shared_B[13][tx]; ssd += tmp*tmp;

tmp = shared_A[14][ty] - shared_B[14][tx]; ssd += tmp*tmp;

tmp = shared_A[15][ty] - shared_B[15][tx]; ssd += tmp*tmp;

My problem is that it does not speed-up my kernel at all… I thought that unrolling loop increased the performance but maybe I do something wrong!

What do you think about that?



Well, bank conflicts may prevent you from getting any speedup in this case, what are all threads doing? tx & ty is threadidx.x & y?

We have

tx = threadIdx.x
ty = threadInd.y

The principle is approximatly similar to matrix multiplication (actually I do the difference instead of the multiplication).
I use shared memory for managing non colesced read and write.

See, blockDim is just 16 (you should atleast have 32 to fill a warp)… What speed up are you expecting???

You are just avoiding 16 branch statements. Are you expecting 16 branch statements to cause “millisecond” delay… You might save at the max 1 or 2 microseconds and thats about it.

Ok, I understand. But I have seen some unroll example (as in reduction example) and the unroll was not more longer… However the gain was approximately 1.5.

Can you explain me when do we need o use unroll?


My guess would be that, if you know the value of BLOCK_DIM as it probably is a hard-coded constant, the compiler knows it as well and unrolls the loop anyway, resulting in you not beeing able to gain any performance by manually doing this.

qUiXui’s guess could be checked with -ptx option given for nvvc. After that both asm codes with useful comments and human-readable labels will be available in <your_source_code>.ptx.