Same kernel in CUDA and OpenCL. One works, the other not.

This is part of the kernel that Nvidia wrote (using CUDA) for the sparse matrix - vector multiplication in the SpMV library (CSR vector kernel):

for(IndexType jj = row_start + thread_lane; jj < row_end; jj += WARP_SIZE)

            sdata[threadIdx.x] += Ax[jj] * fetch_x<UseCache>(Aj[jj], x);

// reduce local sums to row sum (ASSUME: warpsize 32)

        sdata[threadIdx.x] += sdata[threadIdx.x + 16];

        sdata[threadIdx.x] += sdata[threadIdx.x +  8];

        sdata[threadIdx.x] += sdata[threadIdx.x +  4];

        sdata[threadIdx.x] += sdata[threadIdx.x +  2];

        sdata[threadIdx.x] += sdata[threadIdx.x +  1];

// first thread writes warp result

        if (thread_lane == 0)

            y[row] += sdata[threadIdx.x];

sdata is shared memory of size (BLOCK_SIZE + 16) float positions.

It works fine (sure, Nvidia wrote it), at first sight it may seem that there is a race condition problem and that synchronization barriers would be needed, but all of the threads that participate in the reduction belong to the same warp, so they execute concurrently and that eliminates the possibility of race condition.

However, the same code translated to OpenCL:

for(unsigned int jj = row_start + thread_lane; jj < row_end; jj += WARP_SIZE)

            sum += V[jj] * x[J[jj]];

sdata[local_id] = sum;

// reduce local sums to row sum

sdata[local_id]+= sdata[(local_id + 16)];

          sdata[local_id]+= sdata[(local_id + 8)];

          sdata[local_id]+= sdata[(local_id + 4)];

          sdata[local_id]+= sdata[(local_id + 2)];

          sdata[local_id]+= sdata[(local_id + 1)];

// first thread writes warp result

          if (thread_lane == 0)

            y[row] +=  sdata[local_id];

And this code does not run correctly. Even more, the reduction does not run at all, if I comment the 5 lines that do the reduction, the result “y” is exactly the same!!

More strange even, I can make that code run with this change (in each line of the reduction):

sdata[local_id]+= sdata[(local_id + 16)];  ---> sdata[local_id]+= sdata[(local_id + 16)%WORKGROUP_SIZE];

But:

  1. I lose some performance

  2. I can’t be sure that is a “robust” solution

  3. I don’t understand why it works!!!

It may seem that, being the modulus operation a slow one, it may “help” threads to finish their work before accessing the values. But to check this I changed one of the reduction lines in this way:

sdata[local_id]+= sdata[(local_id + 1)];  ---> sdata[local_id]+= sdata[1];

It obviously does not provide the correct result, but I could check that that instruction really gets executed (the result is not the same than commenting that line, as it occurred before), so the modulus operation is not needed to slow things down.

Another possibility that I checked (this is why I tried the modulus op) is that the modulus operation may be preventing a possible overflow of the shared memory. So I changed the WORKGROUP_SIZE value for a much greater value, 512 or 1024, and it continues to give the correct result.

So I don’t understand anything!

If it may help, perhaps you noticed a difference between the two codes:

CUDA

for(IndexType jj = row_start + thread_lane; jj < row_end; jj += WARP_SIZE)

            sdata[threadIdx.x] += Ax[jj] * fetch_x<UseCache>(Aj[jj], x);

OpenCL

for(unsigned int jj = row_start + thread_lane; jj < row_end; jj += WARP_SIZE)

            sum += V[jj] * x[J[jj]];

sdata[local_id] = sum;

The thing is that, if in the OpenCL code, I operate directly over sdata[local_id] inside of the for loop (as in CUDA code), it does not work fine. Of course, sdata[local_id] is initialized to 0 before the for loop, as it is in the CUDA code.

More insight: I have also tested this OpenCL code in some AMD GPUs.

  • In an AMD Fusion GPU, with SDK 2.5, it worked fine without need of the modulus operation

  • In the same GPU, after upgrading to SDK 2.6 and new drivers, it stopped working. With SDK 2.7 and even new drivers it continues to fail. (It works with the modulus op)

  • In an ATI V7800 it works fine with SDK 2.6 and its drivers (that are different than those of the AMD Fusion)

I’m very confused and a bit desperated.

Any idea of what can be happening???

Thanks

Hi,

Any chance that there is a “volatile” pointer declaration missing on the OpenCL version?

Doh!! Thousand thanks!! :)

Yes, a “volatile” declaration for the shared array “sdata” solves the problem. But the performance degrades about a 30%. However, with only a “memfence” before the reduction also produces the correct result, and performance is only a little worse than with modulus operation. I’ll keep refining it.

Thanks for giving me the clue External Image