bug in loop?

#define BDY 32

24 float C[BDY] = {0};

25

35 #pragma unroll 1

36 for(int by=0; by<BDY; by++)

37 {

38 float b = inputB[…];

39

40 #pragma unroll 1

41 for(int row=0; row<BDY; ++row)

42 {

43 int col = by;

44 C[row] += shared[row*A_BLOCK_DIM + col] * b;

45 }

46 }

I’ve beem battling this for a couple days. I really can’t thik of a reason other than the opencl compiler is bugged.

So line 44 C[row] +…, if I compile this, the kernel returns almost immediately, producing neither correct results nor errors.

If I change it to C[by], the kernel returns fine, with correct timing and everything.

If I do

44 C[row] += shared[row*A_BLOCK_DIM + col] * b;

45 C[row] += 1.0;

46 C[row] -= 1.0;

the kernel produces correct result, but timing is off because I’m doing extra in my inner most loop.

Anyone seeing similar issue? I have the cuda sdk 4.0, same thing happened on cuda 3.2.

Thanks…

I cannot answer why this is broken, but you can try to save the compiled kernel into PTX assembly file and look there what is the code really doing.