Detailed Documentation for Mapping Blocks to Warps using CUDA 10.2.89, Compute 7.0

Hello, I am writing a CUDA kernel in which I specify a 3D block size (16x16x4). My kernel relies on a parallel reduction that I am trying to optimize. I am using CUDA version 10.2.89 and building for a Tesla V100, which uses compute capability 7.0 (and I am building with the appropriate NVCC flags to ensure this). Attached is the relevant part of my CUDA:

  __global__ void cuSpMV_2(unsigned int *__restrict__ numCoarsestRows_gpu, unsigned int *__restrict__ mapCoarseToFinerRows_gpu,
                        unsigned int *__restrict__ r_vec_gpu, unsigned int *__restrict__ c_vec_gpu,
                        float *__restrict__ val_gpu, float *__restrict__ x_test_gpu, float *__restrict__ y_gpu) {
        int block = blockIdx.x;
        int bstride = gridDim.x;
        int othread = threadIdx.x * blockDim.y + threadIdx.y;
        int otstride = blockDim.x * blockDim.y;
        int ithread = threadIdx.z;
        int itstride = blockDim.z;
        int tid = threadIdx.x * blockDim.y * blockDim.z +
                  threadIdx.y * blockDim.z +
                  threadIdx.z;

                        /* inside several nested loops with varying parallelism */

                        for(int nnz_index = nnz_index_start + ithread; nnz_index < nnz_index_end; nnz_index += itstride) {
                                temp = __fmaf_rn(val_gpu[nnz_index], x_test_gpu[ c_vec_gpu[nnz_index] ], temp);
                        }

                        values[tid] = temp;

                        __syncthreads();

                        #if BLOCKDIMZ >= 32
                        if (ithread < 16) values[tid] += values[tid + 16];
                        __syncthreads();
                        #endif
                        #if BLOCKDIMZ >= 16
                        if (ithread < 8) values[tid] += values[tid + 8];
                        __syncthreads();
                        #endif
                        #if BLOCKDIMZ >= 8
                        if (ithread < 4) values[tid] += values[tid + 4];
                        __syncthreads();
                        #endif
                        #if BLOCKDIMZ >= 4
                        if (ithread < 2) values[tid] += values[tid + 2];
                        __syncthreads();
                        #endif
                        #if BLOCKDIMZ >= 2
                        if (ithread < 1) values[tid] += values[tid + 1];
                        __syncthreads();
                        #endif

                        if (ithread == 0) y_gpu[rowIndex] = values[tid];

                        /* ... */

This runs correctly but is slow due to the need to check ithread each step of the reduction and synchronize the block after each step. I have to do this to ensure correctness because the threads that execute that innermost loop in parallel are not in the same warp.

Is there a way I can check how blocks are mapped to warps so I can rewrite the relevant parts of this code so that I can eliminate these synchronizations that are slowing down my code? I found some old questions about this from 2008-2009 but given that this post is being written in 2021 I have concerns that it doesn’t work quite the same way 12+ years later.

Thank you!

have you started looking into Cooperative Thread Groups? These offer synchronization primitives at different levels of granularity

As far as the mapping between threads to warps goes, I believe this has not changed between CUDA releases.

Consecutive threads along the threadIdx.x axis will be located consecutively within warps.

I am actually surprised by your expression

        int tid = threadIdx.x * blockDim.y * blockDim.z +
                  threadIdx.y * blockDim.z +
                  threadIdx.z;

because it is the oppposite of what I would expect.

Yeah, I should’ve had the inner loop run among the threadIdx.x axis, but I’d written the code to only run in 2 dimensions before this revision (dim3(16,16)) and just had the inner loop run sequentially, so when I made the modification to parallelize this inner loop it was most natural to just add it along the threadIdx.z axis. I’ll loop at Cooperative Thread Groups and get back about if it worked or not. Thank you very much for your help!

Yes, this worked. Aligning threads along the threadIdx.x axis allowed me to omit those synchronization directives and retain program correctness. Unfortunately it performs worse than running serially (which makes sense as I’m doing sparse matrix-vector multiplication on very sparse matrices, where there may only be 2-3 nonzeros per row, so that inner loop may only have 2-3 iterations). I’ll keep running that inner loop serially for now.