The larger block the better?

I have a question regarding the execution of GEMM (General Matrix Multiply) operations on a GPU’s SM (Streaming Multiprocessor). Suppose an SM can run two blocks for GEMM operations. If I use a single block that is large enough, with the number of threads and other resources doubled compared to the original two blocks, would there be a performance difference? This approach might even save some shared memory usage for reading matrix A. So, my question is, under circumstances where the computational resources are fully utilized, is bigger block size always better?

When it comes to performance, any claim that some particular approach is “always better” is unlikely to hold true, especially when one considers multiple processor and tool-chain generations. Complex interactions in complex machinery rarely lend themselves to simple absolute pronouncements of any kind.

I would recommend benchmarking multiple design variants and drawing conclusions from that. Then revisit the issue as new generations of hardware and software become available.

For GEMM in particular, unless you have some secret sauce that NVIDIA does not know about, the practical approach is to stick with CUBLAS rather than roll your own.

1 Like

More important than the number of blocks and the block size is the number of warps running at the same time on a SM. If the overall number of warps is the same with both approaches, you have to compare, whether with a single block the warps have to wait for each other (synchronize) - here two blocks would have an advantage. If that is not the case, then there could be a caching advantage by using a single block, all depending on the algorithm. GEMM kernels probably would use shared memory instead of relying on the L1 cache.

Haha, I am a student in research field, so I can not use closed-source cublas, and I have to find some scientific creative ideas~

For example, normal GEMM we will have 256 threads for 128*128, each thread uses 128 register and a block uses 24KB shared memory. Now we have a 512 thread block and each thread uses 128 register and now we use 48KB shared memory. We can use cooperative group to sync two group separately. I guess this is beneficial? Or just simply, use one whole group in a block, but it is 512 threads. sync across the whole block.

I am curious what field of research requires that no closed-source libraries be used? Are you also mandated to abstain from other close-sourced software like Mathematica or MATLAB? Does the open-source requirement extend to hardware? I am guessing not, as GPUs would be off the table otherwise.

Just like

Chimera: An Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion

Bolt: Bridging the Gap between Auto-tuners and Hardware-native Performance

At least I will use cutlass. Sadly that is very hard to understand…

It is HPC field, we are writing kernels, software-hardware co-desgin

Chimera and Bolt don’t sound familiar. Are those modern cousins of ATLAS?

1 Like

Seems yes!
Actually my goal is not compete with cublas, but better usage of compute resources, cache resources, maybe kernel fusion, maybe written kernels on other platforms, maybe design new architectures, maybe use GEMM to accelerate other actions like CONV, so in-depth understanding of GEMM and hardware is important for me~