Hello, I am writing a CUDA kernel in which I specify a 3D block size (16x16x4). My kernel relies on a parallel reduction that I am trying to optimize. I am using CUDA version 10.2.89 and building for a Tesla V100, which uses compute capability 7.0 (and I am building with the appropriate NVCC flags to ensure this). Attached is the relevant part of my CUDA:

```
__global__ void cuSpMV_2(unsigned int *__restrict__ numCoarsestRows_gpu, unsigned int *__restrict__ mapCoarseToFinerRows_gpu,
unsigned int *__restrict__ r_vec_gpu, unsigned int *__restrict__ c_vec_gpu,
float *__restrict__ val_gpu, float *__restrict__ x_test_gpu, float *__restrict__ y_gpu) {
int block = blockIdx.x;
int bstride = gridDim.x;
int othread = threadIdx.x * blockDim.y + threadIdx.y;
int otstride = blockDim.x * blockDim.y;
int ithread = threadIdx.z;
int itstride = blockDim.z;
int tid = threadIdx.x * blockDim.y * blockDim.z +
threadIdx.y * blockDim.z +
threadIdx.z;
/* inside several nested loops with varying parallelism */
for(int nnz_index = nnz_index_start + ithread; nnz_index < nnz_index_end; nnz_index += itstride) {
temp = __fmaf_rn(val_gpu[nnz_index], x_test_gpu[ c_vec_gpu[nnz_index] ], temp);
}
values[tid] = temp;
__syncthreads();
#if BLOCKDIMZ >= 32
if (ithread < 16) values[tid] += values[tid + 16];
__syncthreads();
#endif
#if BLOCKDIMZ >= 16
if (ithread < 8) values[tid] += values[tid + 8];
__syncthreads();
#endif
#if BLOCKDIMZ >= 8
if (ithread < 4) values[tid] += values[tid + 4];
__syncthreads();
#endif
#if BLOCKDIMZ >= 4
if (ithread < 2) values[tid] += values[tid + 2];
__syncthreads();
#endif
#if BLOCKDIMZ >= 2
if (ithread < 1) values[tid] += values[tid + 1];
__syncthreads();
#endif
if (ithread == 0) y_gpu[rowIndex] = values[tid];
/* ... */
```

This runs correctly but is slow due to the need to check `ithread`

each step of the reduction and synchronize the block after each step. I have to do this to ensure correctness because the threads that execute that innermost loop in parallel are not in the same warp.

Is there a way I can check how blocks are mapped to warps so I can rewrite the relevant parts of this code so that I can eliminate these synchronizations that are slowing down my code? I found some old questions about this from 2008-2009 but given that this post is being written in 2021 I have concerns that it doesn’t work quite the same way 12+ years later.

Thank you!