Relationship between Thread Block dimension and warps

taekgeun9025 · April 20, 2024, 5:28pm

I’ve been studying CUDA recently and have a question about the relationship between Thread Block Dimension and warp performance.

I wrote a kernel for naive matrix multiplication (without using shared memory) and executed it with varying thread block dimensions totaling 512 threads: (512, 1, 1), (256, 2, 1), (128, 4, 1), …, (1, 512, 1). Surprisingly, I found that the dimension (32, 16, 1) yielded the best performance, even though every configuration showed 100% warp occupancy.

Now, my question is: why did performance change with different thread block dimensions? I initially thought that threads were abstract and internally indexed as 1D, similar to memory concepts, and therefore changing the block dimension shouldn’t affect performance. Is it related to global memory access patterns or perhaps caching issues?

Robert_Crovella · April 20, 2024, 5:52pm

quite possibly. Since you seem to already be using the profiler (nsight compute), the profiler can tell you about memory access efficiency. The general idea is that adjacent threads in the warp will be most efficient at global memory access if they are accessing adjacent locations in memory (and for naive matrix multiplication, this should be readily possible when reading values from the B matrix and writing values to the C matrix, i.e. for 2 of the 3 global access patterns). You should be able to study your indexing calculation, and then ask “what will thread 0 index be?” and “what will thread 1 index be?”, and so on, in order to statically analyze this.

ultimately, the answer will depend on how you did your indexing calculations in the kernel. If you use canonical 2D index creation for row and column, then my expectation is that the performance differences between e.g. (512,1,1) case and (32,16,1) case should be relatively smaller, whereas performance will get substantially worse as the first (x) block dimension drops below 32, down to (1,512,1) which I think would be the worst.

it should certainly be possible (if that were your goal) to write a kernel whose performance is invariant across block dimension, given that there will always be 512 threads per block.

taekgeun9025 · April 21, 2024, 7:39am

Thank you, Robert.
I used the traditional 2D index generation as below:
tx = blockIdx.x * blockDim.x + threadIdx.x
ty = blockIdx.y * blockDim.y + threadIdx.y

And matrix multiplication is done like this:
for (int k = 0; k < 2048; k++)
C[ty * 4096 + tx] += A[ty * 2048 + k] * B[k * 4096 + tx]

I set the thread block to (512, 1, 1) or (32, 16, 1), but I expected that the composition of warp will be the same since the warp is row-major. Also, each row is a multiple of 32 (the number of threads per warp), so I thought that adjacent threads within a warp would have similar indices. Therefore, when threads within a warp access data, I thought they would access adjacent memory locations. Could I be oversimplifying things? Is there something else to consider when forming warps?

Robert_Crovella · April 21, 2024, 6:40pm

According to my testing, there is not much performance difference in these cases:

(256,2), (128,4), (64, 8), (32,16)

and that is as I would expect. Let’s call these cases 1.0x performance.

According to my testing, these cases get progressively substantially worse:

(16,32), (8,64), (4,128), (2,256), (1,512)

and that is as I would expect - eventually getting to about 7x worse than the first set of cases. This is due to the arrangement of addresses in each warp.

I agree that the (512,1) case is noticeably worse (1.5x) than the first set of cases I listed above, and I don’t have an immediate explanation. I suspect a caching issue, but I can’t explain that in any detail either.

Here is my test case, CUDA 12.2, L4 GPU:

# cat t180.cu
#include <iostream>

#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start=0){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

template <typename T>
__global__ void mm(const T * __restrict__ A, const T * __restrict__ B, T * __restrict__ C, const int m, const int n, const int k)
{
    T Cvalue = 0;
    const int row = blockIdx.y * blockDim.y + threadIdx.y;
    const int col = blockIdx.x * blockDim.x + threadIdx.x;
    if ((row < m) && (col < n)){
      T Cvalue = 0;
      for (int e = 0; e < k; ++e)
        Cvalue += A[row * k + e]
                * B[e * n + col];
      C[row * n + col] = Cvalue;}
}


int main(){
  using mt = float;
  const int m = 2048;
  const int n = 2048;
  const int k = 2048;
  const int bdimlimit = 512;
  const int iter = 10;
  mt *A, *B, *C;
  cudaMalloc(&A, sizeof(*A)*m*k);
  cudaMalloc(&B, sizeof(*B)*k*n);
  cudaMalloc(&C, sizeof(*C)*m*n);
  cudaMemset(A, 0, sizeof(*A)*m*k);
  cudaMemset(B, 0, sizeof(*B)*k*n);
  cudaMemset(C, 0, sizeof(*C)*m*n);
  for (int i = 0; i < iter; i++)
    mm<<<dim3(n/32,m/32,1), dim3(32,32,1)>>>(A, B, C, m, n, k); // warm-up
  cudaDeviceSynchronize();
  for (int h = 1; h < 2*bdimlimit; h*=2){
    int w = bdimlimit/h;
    dim3 block = dim3(w,h,1);
    dim3 grid  = dim3(n/w, m/h, 1);
    unsigned long long dt = dtime_usec(0);
    for (int i = 0; i < iter; i++)
      mm<<<grid, block>>>(A, B, C, m, n, k);
    cudaDeviceSynchronize();
    dt = dtime_usec(dt);
    std::cout << "(" << w << "," << h << "): " << dt/iter << "us"  << std::endl;
  }
}


# nvcc -o t180 t180.cu -arch=sm_89
# ./t180
(512,1): 16393us
(256,2): 11058us
(128,4): 11062us
(64,8): 10521us
(32,16): 10323us
(16,32): 10292us
(8,64): 13919us
(4,128): 21805us
(2,256): 39455us
(1,512): 77805us
#

To go further in perf analysis for me anyway would require probably spending some time with the profiler, to see what nsight compute thinks are the significant differences in the (512,1) vs. (256,2) cases. I haven’t done that yet.

taekgeun9025 · April 22, 2024, 5:09am

Thank you for a detailed explanation!!

Topic		Replies	Views
Efficient use of shared memory CUDA Programming and Performance	29	4486	December 2, 2019
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16301	January 30, 2011
Newbie help on thread blocks CUDA Programming and Performance	22	10600	December 24, 2008
Thread Block Shape Versus Performance Choosing proper Thread Block Shape CUDA Programming and Performance	6	6973	May 23, 2007
Test Multi Threading Spinning CUDA Programming and Performance	32	4812	July 20, 2011
full warp Vs. half warp coalesced global memory loads CUDA Programming and Performance	7	8037	May 27, 2009
CUDA Warp Synchronization Problem CUDA Programming and Performance	5	2161	February 27, 2011
Fewer threads per block = ... faster performance? CUDA Programming and Performance	9	127	December 31, 2024
Thread Block Size what difference does it make? CUDA Programming and Performance	6	5403	June 3, 2008
__syncthreads and shared memory CUDA Programming and Performance	21	4331	June 15, 2011

Relationship between Thread Block dimension and warps

Related topics