According to my testing, there is not much performance difference in these cases:

(256,2), (128,4), (64, 8), (32,16)

and that is as I would expect. Let’s call these cases 1.0x performance.

According to my testing, these cases get progressively substantially worse:

(16,32), (8,64), (4,128), (2,256), (1,512)

and that is as I would expect - eventually getting to about 7x worse than the first set of cases. This is due to the arrangement of addresses in each warp.

I agree that the (512,1) case is noticeably worse (1.5x) than the first set of cases I listed above, and I don’t have an immediate explanation. I suspect a caching issue, but I can’t explain that in any detail either.

Here is my test case, CUDA 12.2, L4 GPU:

```
# cat t180.cu
#include <iostream>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
unsigned long long dtime_usec(unsigned long long start=0){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
template <typename T>
__global__ void mm(const T * __restrict__ A, const T * __restrict__ B, T * __restrict__ C, const int m, const int n, const int k)
{
T Cvalue = 0;
const int row = blockIdx.y * blockDim.y + threadIdx.y;
const int col = blockIdx.x * blockDim.x + threadIdx.x;
if ((row < m) && (col < n)){
T Cvalue = 0;
for (int e = 0; e < k; ++e)
Cvalue += A[row * k + e]
* B[e * n + col];
C[row * n + col] = Cvalue;}
}
int main(){
using mt = float;
const int m = 2048;
const int n = 2048;
const int k = 2048;
const int bdimlimit = 512;
const int iter = 10;
mt *A, *B, *C;
cudaMalloc(&A, sizeof(*A)*m*k);
cudaMalloc(&B, sizeof(*B)*k*n);
cudaMalloc(&C, sizeof(*C)*m*n);
cudaMemset(A, 0, sizeof(*A)*m*k);
cudaMemset(B, 0, sizeof(*B)*k*n);
cudaMemset(C, 0, sizeof(*C)*m*n);
for (int i = 0; i < iter; i++)
mm<<<dim3(n/32,m/32,1), dim3(32,32,1)>>>(A, B, C, m, n, k); // warm-up
cudaDeviceSynchronize();
for (int h = 1; h < 2*bdimlimit; h*=2){
int w = bdimlimit/h;
dim3 block = dim3(w,h,1);
dim3 grid = dim3(n/w, m/h, 1);
unsigned long long dt = dtime_usec(0);
for (int i = 0; i < iter; i++)
mm<<<grid, block>>>(A, B, C, m, n, k);
cudaDeviceSynchronize();
dt = dtime_usec(dt);
std::cout << "(" << w << "," << h << "): " << dt/iter << "us" << std::endl;
}
}
# nvcc -o t180 t180.cu -arch=sm_89
# ./t180
(512,1): 16393us
(256,2): 11058us
(128,4): 11062us
(64,8): 10521us
(32,16): 10323us
(16,32): 10292us
(8,64): 13919us
(4,128): 21805us
(2,256): 39455us
(1,512): 77805us
#
```

To go further in perf analysis for me anyway would require probably spending some time with the profiler, to see what nsight compute thinks are the significant differences in the (512,1) vs. (256,2) cases. I haven’t done that yet.