I would like to explore a method for conducting quantitative analysis on whether two CUDA kernels can be executed simultaneously without compromising performance.
I have developed a GEMM kernel with a size of 384x384 for testing purposes. The kernel is configured as <<<(3,3,1), (16, 16, 1)>>>, which means there are a total of 9 thread blocks and 256 threads within each block. The shared memory allocated per block is 21504 bytes, while each thread utilizes 128 registers. The physical limitations of the GPU are detailed in the picture provided below.
From the kernel’s config and physical limit, we can get allocatable thread blocks per SM shown in the pic below. Due to the limitation of registers per SM, only two thread blocks can be allocated on a single SM.
The GPU consists of 16 streaming multiprocessors (SMs). The kernel’s thread blocks will be allocated to 5 out of 16 SMs, leaving 9 SMs available. Consequently, in theory, two GEMM kernels could run concurrently without affecting performance.
In reality, running two kernels in parallel results in an increased latency of 42% to 72% compared to running a single GEMM kernel independently. Shown in the Nsight System screenshot below.
In order to exclude the factor of memory bandwidth, from the result of Nsight Compute the max bandwidth is 30.95% which should not be a bottleneck of performance.
Why is there such a significant impact when two kernels running in different SMs interact with each other? Could you provide some guidance on the best approach to quantitatively analyze whether two CUDA kernels can run concurrently without sacrificing performance?
Attached are the reports for the Nsight System and Nsight Compute for your reference. I would greatly appreciate your response. Thank you!
Suppose your GEMM kernels are well-written and they issue a lot of FMA operations. Suppose the FMA throughput of the SM is limited. Then your theory would probably be disproven. You could also say the same thing about throughput to practically any other shared resource such as the LD/ST unit, shared memory, L2 cache, etc.
I personally wouldn’t bother to undertake an exercise like that or set about trying to answer that question, without full access to the code as well as full access to a profiler, and sufficient time to dedicate to it
It strikes me as a very difficult problem to perform static analysis on CUDA C++ kernel code, to estimate resource impacts in concurrency. I’m fairly certain nsight compute will give you clues as to performance limiters.
You seem to be imagining that after launching the first kernel, the GPU block scheduler/CWD will distribute two threadblocks to some SMs while other SMs are empty. Although I don’t know that the CWD behavior is specified, I’m quite certain it does not work that way. If you have 9 threadblocks, the most likely scenario for a fresh launch/empty GPU is that they will occupy 9 SMs. Then when you launch another 9 threadblocks, they will then distribute themselves in some fashion. Therefore the idea that the first kernel launch occupies only 5 SMs and the second kernel launch occupies a separate set of 5 SMs (which would certainly help to support your theory that the kernels should be able to run concurrently with no performance impact on each other) is almost certainly not the case, in my opinion. Based on what you have described, I would suspect that some of the blocks of the 2nd kernel launch will find themselves deposited on SMs that are already running a threadblock. Then you have to do a fairly complex resource analysis, perhaps instruction-by-instruction, to estimate impacts based on throughputs of various shared resources, including all the functional units in the SM touched by any instruction in the kernel, as well as all the pathways in the memory hierarchy, and probably other shared resources as well.
Yes, you are right, only one thread block would be allocated on an SM in one schedule round. The next thread block will be allocated to the SM which has the most available resource. I was misled by the value of Active Thread Blocks per Multiprocessor reported by Nsight Compute. By the way, in my demo, there are 9 thread blocks and 16 SMs, why is the value of Active Thread Blocks per Multiprocessor 2?