Relationship between Thread Block dimension and warps

I’ve been studying CUDA recently and have a question about the relationship between Thread Block Dimension and warp performance.

I wrote a kernel for naive matrix multiplication (without using shared memory) and executed it with varying thread block dimensions totaling 512 threads: (512, 1, 1), (256, 2, 1), (128, 4, 1), …, (1, 512, 1). Surprisingly, I found that the dimension (32, 16, 1) yielded the best performance, even though every configuration showed 100% warp occupancy.

Now, my question is: why did performance change with different thread block dimensions? I initially thought that threads were abstract and internally indexed as 1D, similar to memory concepts, and therefore changing the block dimension shouldn’t affect performance. Is it related to global memory access patterns or perhaps caching issues?

quite possibly. Since you seem to already be using the profiler (nsight compute), the profiler can tell you about memory access efficiency. The general idea is that adjacent threads in the warp will be most efficient at global memory access if they are accessing adjacent locations in memory (and for naive matrix multiplication, this should be readily possible when reading values from the B matrix and writing values to the C matrix, i.e. for 2 of the 3 global access patterns). You should be able to study your indexing calculation, and then ask “what will thread 0 index be?” and “what will thread 1 index be?”, and so on, in order to statically analyze this.

ultimately, the answer will depend on how you did your indexing calculations in the kernel. If you use canonical 2D index creation for row and column, then my expectation is that the performance differences between e.g. (512,1,1) case and (32,16,1) case should be relatively smaller, whereas performance will get substantially worse as the first (x) block dimension drops below 32, down to (1,512,1) which I think would be the worst.

it should certainly be possible (if that were your goal) to write a kernel whose performance is invariant across block dimension, given that there will always be 512 threads per block.

Thank you, Robert.
I used the traditional 2D index generation as below:
tx = blockIdx.x * blockDim.x + threadIdx.x
ty = blockIdx.y * blockDim.y + threadIdx.y

And matrix multiplication is done like this:
for (int k = 0; k < 2048; k++)
C[ty * 4096 + tx] += A[ty * 2048 + k] * B[k * 4096 + tx]

I set the thread block to (512, 1, 1) or (32, 16, 1), but I expected that the composition of warp will be the same since the warp is row-major. Also, each row is a multiple of 32 (the number of threads per warp), so I thought that adjacent threads within a warp would have similar indices. Therefore, when threads within a warp access data, I thought they would access adjacent memory locations. Could I be oversimplifying things? Is there something else to consider when forming warps?

According to my testing, there is not much performance difference in these cases:

(256,2), (128,4), (64, 8), (32,16)

and that is as I would expect. Let’s call these cases 1.0x performance.

According to my testing, these cases get progressively substantially worse:

(16,32), (8,64), (4,128), (2,256), (1,512)

and that is as I would expect - eventually getting to about 7x worse than the first set of cases. This is due to the arrangement of addresses in each warp.

I agree that the (512,1) case is noticeably worse (1.5x) than the first set of cases I listed above, and I don’t have an immediate explanation. I suspect a caching issue, but I can’t explain that in any detail either.

Here is my test case, CUDA 12.2, L4 GPU:

# cat t180.cu
#include <iostream>

#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start=0){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

template <typename T>
__global__ void mm(const T * __restrict__ A, const T * __restrict__ B, T * __restrict__ C, const int m, const int n, const int k)
{
    T Cvalue = 0;
    const int row = blockIdx.y * blockDim.y + threadIdx.y;
    const int col = blockIdx.x * blockDim.x + threadIdx.x;
    if ((row < m) && (col < n)){
      T Cvalue = 0;
      for (int e = 0; e < k; ++e)
        Cvalue += A[row * k + e]
                * B[e * n + col];
      C[row * n + col] = Cvalue;}
}


int main(){
  using mt = float;
  const int m = 2048;
  const int n = 2048;
  const int k = 2048;
  const int bdimlimit = 512;
  const int iter = 10;
  mt *A, *B, *C;
  cudaMalloc(&A, sizeof(*A)*m*k);
  cudaMalloc(&B, sizeof(*B)*k*n);
  cudaMalloc(&C, sizeof(*C)*m*n);
  cudaMemset(A, 0, sizeof(*A)*m*k);
  cudaMemset(B, 0, sizeof(*B)*k*n);
  cudaMemset(C, 0, sizeof(*C)*m*n);
  for (int i = 0; i < iter; i++)
    mm<<<dim3(n/32,m/32,1), dim3(32,32,1)>>>(A, B, C, m, n, k); // warm-up
  cudaDeviceSynchronize();
  for (int h = 1; h < 2*bdimlimit; h*=2){
    int w = bdimlimit/h;
    dim3 block = dim3(w,h,1);
    dim3 grid  = dim3(n/w, m/h, 1);
    unsigned long long dt = dtime_usec(0);
    for (int i = 0; i < iter; i++)
      mm<<<grid, block>>>(A, B, C, m, n, k);
    cudaDeviceSynchronize();
    dt = dtime_usec(dt);
    std::cout << "(" << w << "," << h << "): " << dt/iter << "us"  << std::endl;
  }
}


# nvcc -o t180 t180.cu -arch=sm_89
# ./t180
(512,1): 16393us
(256,2): 11058us
(128,4): 11062us
(64,8): 10521us
(32,16): 10323us
(16,32): 10292us
(8,64): 13919us
(4,128): 21805us
(2,256): 39455us
(1,512): 77805us
#

To go further in perf analysis for me anyway would require probably spending some time with the profiler, to see what nsight compute thinks are the significant differences in the (512,1) vs. (256,2) cases. I haven’t done that yet.

Thank you for a detailed explanation!!