Naive Matrix Multiplication: Confused about Optimal Threadblock Size

Hey,

I’m learning CUDA and was implementing a naive matrix multiplication along the lines of:

__global__ void naiveMM(double *__restrict__ a,
                         double *__restrict__ b,
                         double *__restrict__ c,
                         int N)
{
    int    row = blockIdx.y * blockDim.y + threadIdx.y;
    int    col = blockIdx.x * blockDim.x + threadIdx.x;
    double sum = 0.0;

    if (col < N && row < N) {
            for (int i = 0; i < N; i++) {
                sum += a[row * N + i] * b[i * N + col];
            }
            c[row * N + col] = sum;
     }
}

Lets say I run this with matrices of size 3200, once with grid(100,100,1), block(32,32,1) and once with grid(200,200,1), block(16,16,1).

We get the same number of total threads, both threads in a block are divisible by 32 so they can perfectly fit in a warp, and after dividing the threads per block by 32 both are a divisor of 64 to perfectly fill up a SM with 64 warps. So if I understand this post Question about threads per block and warps per SM correctly they should be equal in terms of scheduling.

I get around 1.5 times the performance using the block(16,16,1) approach though.

When I took a look at how the threads within a block are grouped up in warps I found this section 1. Introduction — CUDA C Programming Guide, which indicates that they are grouped first along the x dimension then along y.

If I’m understanding this correctly that means for the block(32,32,1) version all threads within a warp have the same row and sequential cols, meaning the row value can be broadcast and the cols can be accessed in a nicely coaleced way.
For block(16,16,1) each warp looks at 2 rows and 16 columns, with the 2 row values being somewhat distant from each other in memory.

Inuitively for me that means, that the block(32,32,1) version should perform better, but across multiple runs and matrix sizes, this is not the case.

What effect am I missing here and when observing things like these is there any profiler/profiler metric, which could indicate what specifically is better for the block(16,16,1) version?

one possibility: occupancy (yes, nsight compute can report occupancy).

what GPU are you running these experiments on?

I see, Thanks!

I’m on a A-100:

Device:                         NVIDIA A100 80GB PCIe                   
CUDA driver version:            12.40                                   
GPU clock rate:                 MHz 1410                                
Memory clock rate:              MHz 756                                 
Compute Capability:             8.0

With that I get for 32 block version:

Section: Occupancy
    ------------------------------- ----------- ------------
    Metric Name                     Metric Unit Metric Value
    ------------------------------- ----------- ------------
    Block Limit SM                        block           32
    Block Limit Registers                 block            1
    Block Limit Shared Mem                block            8
    Block Limit Warps                     block            2
    Theoretical Active Warps per SM        warp           32
    Theoretical Occupancy                     %           50
    Achieved Occupancy                        %        49.82
    Achieved Active Warps Per SM           warp        31.88
    ------------------------------- ----------- ------------

For 16:

Section: Occupancy
    ------------------------------- ----------- ------------
    Metric Name                     Metric Unit Metric Value
    ------------------------------- ----------- ------------
    Block Limit SM                        block           32
    Block Limit Registers                 block            6
    Block Limit Shared Mem                block           32
    Block Limit Warps                     block            8
    Theoretical Active Warps per SM        warp           48
    Theoretical Occupancy                     %           75
    Achieved Occupancy                        %        74.64
    Achieved Active Warps Per SM           warp        47.77
    ------------------------------- ----------- ------------

Am I reading this correctly that the issue for the 32 block version is that the blocks put to much pressure on the registers and the registers can only support one 32 block at a time, while for the 16 block version they can support 6 at a time. Or what exactly does Block Limit Registers refer to?

What’s also still a bit confusing is that for a 8 block version I get:

    Section: Occupancy
    ------------------------------- ----------- ------------
    Metric Name                     Metric Unit Metric Value
    ------------------------------- ----------- ------------
    Block Limit SM                        block           32
    Block Limit Registers                 block           24
    Block Limit Shared Mem                block           32
    Block Limit Warps                     block           32
    Theoretical Active Warps per SM        warp           48
    Theoretical Occupancy                     %           75
    Achieved Occupancy                        %        73.23
    Achieved Active Warps Per SM           warp        46.87
    ------------------------------- ----------- ------------

So roughly the same occupancy as for 16, but I get consistently around 15% longer runtime.

Yes. That is what it means. The registers per thread for the 32x32 threadblock kernel, combined with the SM limit of 64K registers, yields a situation where the SM can support 1 but not 2 threadblocks. i.e. the SM can support 1024 threads.

The registers per thread for the 16x16 kernel combined with the SM limit indicate that a maximum of 6 of those threadblocks can be resident on the SM, at once. 6x256= 1536 threads. So the occupancy (theoretical or achieved) is basically 1.5x higher, and this leads to a 1.5x higher perf/throughput.

I don’t know what that means. Do you mean launching a threadblock with 8x8 block size? If so, as you get to smaller block sizes, the issue you pointed out earlier:

will come into play. From the perspective of global memory load efficiency (coalescing), the threadblock configuration with 32 threads in x dimension should be best, followed by 16, then 8. But this effect will be somewhat less pronounced (evidently) than the occupancy issue.

You could try a 32x8 threadblock configuration (or perhaps 32x16). Might perform better than any of the ones you have tried so far.

I don’t know what that means. Do you mean launching a threadblock with 8x8 block size? If so, as you get to smaller block sizes, the issue you pointed out earlier.

Yeah, thats what I meant.

You could try a 32x8 threadblock configuration (or perhaps 32x16). Might perform better than any of the ones you have tried so far.

Alright, thank you so much!

It’s weird (to me), for the code you have shown I’m surprised its using more than 32 registers per thread. When I compile it on CUDA 12.2, for sm_80, I get 32 registers per thread. For sm_89 I do get 40 registers per thread, so I guess it must be the case that on CUDA 12.4 that code compiles to something larger than 32 registers per thread.

At 32 registers per thread or less, you should be able to get two 32x32 blocks resident on a A100 SM.

The code here is missing a repetition loop which scales with the matrix size and is 45 for matrix_size=2000:

__global__ void naiveMM(double *__restrict__ a,
                         double *__restrict__ b,
                         double *__restrict__ c,
                         int N,
                         int REP)
{

    int    row = blockIdx.y * blockDim.y + threadIdx.y;
    int    col = blockIdx.x * blockDim.x + threadIdx.x;
    double sum = 0.0;

    for (int r = 0; r < REP; ++r)
        if (col < N && row < N) {
            for (int i = 0; i < N; i++) {
                sum += a[row * N + i] * b[i * N + col];
            }
            c[row * N + col] = sum;
        }
}

So I’m assuming the additional r the kernel has to keep around is the reason why I’m reaching the Block Limit Registers limit earlier than expected.

Interestingly when I replace REP with a hardcoded constant for r < 3 I can now get 100% occupancy for (32,32,1) blocks, but for r < 4 I’m back to 50%. I’m guessing the loop gets completely unrolled for r < 3 and not for r < 4?

Perhaps. One would need to look at the SASS to be sure of what is going on. There are numerous questions about how to study SASS on various forums if you want to explore.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.