This is part of the kernel that Nvidia wrote (using CUDA) for the sparse matrix - vector multiplication in the SpMV library (CSR vector kernel):
for(IndexType jj = row_start + thread_lane; jj < row_end; jj += WARP_SIZE)
sdata[threadIdx.x] += Ax[jj] * fetch_x<UseCache>(Aj[jj], x);
// reduce local sums to row sum (ASSUME: warpsize 32)
sdata[threadIdx.x] += sdata[threadIdx.x + 16];
sdata[threadIdx.x] += sdata[threadIdx.x + 8];
sdata[threadIdx.x] += sdata[threadIdx.x + 4];
sdata[threadIdx.x] += sdata[threadIdx.x + 2];
sdata[threadIdx.x] += sdata[threadIdx.x + 1];
// first thread writes warp result
if (thread_lane == 0)
y[row] += sdata[threadIdx.x];
sdata is shared memory of size (BLOCK_SIZE + 16) float positions.
It works fine (sure, Nvidia wrote it), at first sight it may seem that there is a race condition problem and that synchronization barriers would be needed, but all of the threads that participate in the reduction belong to the same warp, so they execute concurrently and that eliminates the possibility of race condition.
However, the same code translated to OpenCL:
for(unsigned int jj = row_start + thread_lane; jj < row_end; jj += WARP_SIZE)
sum += V[jj] * x[J[jj]];
sdata[local_id] = sum;
// reduce local sums to row sum
sdata[local_id]+= sdata[(local_id + 16)];
sdata[local_id]+= sdata[(local_id + 8)];
sdata[local_id]+= sdata[(local_id + 4)];
sdata[local_id]+= sdata[(local_id + 2)];
sdata[local_id]+= sdata[(local_id + 1)];
// first thread writes warp result
if (thread_lane == 0)
y[row] += sdata[local_id];
And this code does not run correctly. Even more, the reduction does not run at all, if I comment the 5 lines that do the reduction, the result “y” is exactly the same!!
More strange even, I can make that code run with this change (in each line of the reduction):
sdata[local_id]+= sdata[(local_id + 16)]; ---> sdata[local_id]+= sdata[(local_id + 16)%WORKGROUP_SIZE];
But:
-
I lose some performance
-
I can’t be sure that is a “robust” solution
-
I don’t understand why it works!!!
It may seem that, being the modulus operation a slow one, it may “help” threads to finish their work before accessing the values. But to check this I changed one of the reduction lines in this way:
sdata[local_id]+= sdata[(local_id + 1)]; ---> sdata[local_id]+= sdata[1];
It obviously does not provide the correct result, but I could check that that instruction really gets executed (the result is not the same than commenting that line, as it occurred before), so the modulus operation is not needed to slow things down.
Another possibility that I checked (this is why I tried the modulus op) is that the modulus operation may be preventing a possible overflow of the shared memory. So I changed the WORKGROUP_SIZE value for a much greater value, 512 or 1024, and it continues to give the correct result.
So I don’t understand anything!
If it may help, perhaps you noticed a difference between the two codes:
CUDA
for(IndexType jj = row_start + thread_lane; jj < row_end; jj += WARP_SIZE)
sdata[threadIdx.x] += Ax[jj] * fetch_x<UseCache>(Aj[jj], x);
OpenCL
for(unsigned int jj = row_start + thread_lane; jj < row_end; jj += WARP_SIZE)
sum += V[jj] * x[J[jj]];
sdata[local_id] = sum;
The thing is that, if in the OpenCL code, I operate directly over sdata[local_id] inside of the for loop (as in CUDA code), it does not work fine. Of course, sdata[local_id] is initialized to 0 before the for loop, as it is in the CUDA code.
More insight: I have also tested this OpenCL code in some AMD GPUs.
-
In an AMD Fusion GPU, with SDK 2.5, it worked fine without need of the modulus operation
-
In the same GPU, after upgrading to SDK 2.6 and new drivers, it stopped working. With SDK 2.7 and even new drivers it continues to fail. (It works with the modulus op)
-
In an ATI V7800 it works fine with SDK 2.6 and its drivers (that are different than those of the AMD Fusion)
I’m very confused and a bit desperated.
Any idea of what can be happening???
Thanks