matrix vector reduction tree bug


I am trying to implement a matrix vector kernel with reduction tree as below code snippet. This implementation is based on spmv_csr_vector_kernel explained in this paper. []

But it does NOT work without adding lot of syncthreads after each if condition of the reduction tree.
If I remove the syncthreads after each if condition in the kernel, the CPU and GPU results don’t match and the test reports failure.

Based on my understanding these synchthreads are not necessary. So wanted some help to figure out if this implementation has any bug.

Thank you,

#include <stdio.h>

#define THREADS_PER_ROW 32
#define BLOCK_SIZE 256
#define NUM_ROWS 256
#define NUM_COLS 256

#define cudaErrorCheck(err) \
  if(err != cudaSuccess) { \
    printf("##### error in cuda call at %s: %d. %s: %s #####\n", __FILE__, __LINE__, cudaGetErrorName(err), cudaGetErrorString(err)); \
    exit(1); \

__global__ void matrix_vector_kernel ( const int num_rows , int num_cols, const double * data , const double * x , double * y)
    __shared__ double vals [BLOCK_SIZE];

    int thread_id = blockDim.x * blockIdx.x + threadIdx.x ; // global thread index
    int warp_id = thread_id / THREADS_PER_ROW; // global warp index
    int lane = thread_id & (THREADS_PER_ROW-1); // thread index within the warp
    int row = warp_id;

    if(row < num_rows){
        vals [ threadIdx.x ] = 0;
        for ( int jj = lane ; jj < num_cols ; jj += THREADS_PER_ROW)
            vals [ threadIdx.x ] += data [ (row*num_cols)+jj ] * x [jj];

        // parallel reduction in shared memory
        if (lane < 16) vals[threadIdx.x] += vals[threadIdx.x + 16];
        if (lane <  8) vals[threadIdx.x] += vals[threadIdx.x +  8];
        if (lane <  4) vals[threadIdx.x] += vals[threadIdx.x +  4];
        if (lane <  2) vals[threadIdx.x] += vals[threadIdx.x +  2];
        if (lane <  1) vals[threadIdx.x] += vals[threadIdx.x +  1];
        // first thread OF EACH WARP ACCUMULATES the result
        if ( lane == 0)
            y[row] = vals [ threadIdx.x ];

int main(){


    for(int i=0; i<NUM_ROWS; i++){
        for(int j=0; j<NUM_COLS; j++)
            M[i*NUM_COLS + j] = i*NUM_COLS + j;
        P[i] = 0;
    for(int j=0; j<NUM_COLS; j++)
        N[j] = j;

    //GPU calc
    double *m, *n, *p;

    cudaErrorCheck(cudaMalloc (( void **)&m , NUM_ROWS*NUM_COLS*sizeof(double)));
    cudaErrorCheck(cudaMalloc (( void **)&n , NUM_COLS*sizeof(double)));
    cudaErrorCheck(cudaMalloc (( void **)&p , NUM_ROWS*sizeof(double)));

    cudaErrorCheck(cudaMemcpy(m, M, NUM_ROWS*NUM_COLS*sizeof(double), cudaMemcpyHostToDevice));
    cudaErrorCheck(cudaMemcpy(n, N, NUM_COLS*sizeof(double), cudaMemcpyHostToDevice));
    cudaErrorCheck(cudaMemcpy(p, P, NUM_ROWS*sizeof(double), cudaMemcpyHostToDevice));

    int rows_per_block = BLOCK_SIZE/THREADS_PER_ROW;
    int num_of_blocks = (NUM_ROWS + rows_per_block - 1) / rows_per_block;
    matrix_vector_kernel<<<num_of_blocks, BLOCK_SIZE>>>(NUM_ROWS, NUM_COLS, m, n, p);

    cudaErrorCheck(cudaMemcpy(P, p, NUM_ROWS*sizeof(double), cudaMemcpyDeviceToHost));

    //cpu check
    for(int i=0; i<NUM_ROWS; i++){
        double temp = 0;
        for(int j=0; j<NUM_COLS; j++)
            temp += M[i*NUM_COLS + j] * N[j];
        if(temp != P[i]){
            printf("error for element %d. CPU: %f, GPU: %f\n", i, temp, P[i]);
            return 0;


    return 0;

__syncthreads() or something like it is required. The code you are working with was apparently written in 2008. At lot of evolution has occurred in CUDA since then.

First of all, if you are on a volta or turing processor, there is no real guarantee that threads in a warp execute in lockstep, although I don’t think that is the issue here. However for correctness, the code requires synchronization (__syncwarp() might also be sufficient for this particular case).

My guess is that the issue is one of compiler optimization. __syncthreads() is both an execution barrier and a memory barrier. This means it forces shared memory to be updated, preventing the compiler from optimizing a location into a register. Early CUDA compilers generally did not do this. Later ones do. Both are legitimate from the compiler’s perspective. You as a program must ensure load/store ordering of data in a shared environment. A non-recommended alternative here might be to eliminate the __syncthreads but mark the shared memory with the volatile keyword.

Got it!. Thanks for the workaround and detailed explanation.