#define BDY 32
24 float C[BDY] = {0};
25
…
35 #pragma unroll 1
36 for(int by=0; by<BDY; by++)
37 {
38 float b = inputB[…];
39
40 #pragma unroll 1
41 for(int row=0; row<BDY; ++row)
42 {
43 int col = by;
44 C[row] += shared[row*A_BLOCK_DIM + col] * b;
45 }
46 }
…
I’ve beem battling this for a couple days. I really can’t thik of a reason other than the opencl compiler is bugged.
So line 44 C[row] +…, if I compile this, the kernel returns almost immediately, producing neither correct results nor errors.
If I change it to C[by], the kernel returns fine, with correct timing and everything.
If I do
44 C[row] += shared[row*A_BLOCK_DIM + col] * b;
45 C[row] += 1.0;
46 C[row] -= 1.0;
the kernel produces correct result, but timing is off because I’m doing extra in my inner most loop.
Anyone seeing similar issue? I have the cuda sdk 4.0, same thing happened on cuda 3.2.
Thanks…