I am making a benchmarking program to automate the collection of timings data for individual kernels in a larger library / software suite I am building. Making this additional test program has provided the extra benefit of pushing me to re-think the setup and come at things from another perspective, which has uncovered a bug or two.
But, on to the question… I designed my kernels around “work units” such that each thread block will take a work unit, do it, then reach for the next one in an asynchronous fashion. Over a large workload, however, the blocks should all get done in about as quick a manner as possible. This approach has worked well in another program I worked on, and I am adding some polish to the system. One would hypothesize that, so long as the card is evenly filled by X work units, then providing 2X work units would require twice as long to compute, plus some launch latency. The card is being bombarded with one workload after another, but overall (so far) the workloads are not driving up the temperature of the card, likely because of the length of the longer-running kernel and the fact that its work units are somewhat sparse in this test.
When I do this experiment with each of two kernels, I find that the results are indeed linear in the workload, but I get some pretty striking results. I am running these experiments with a number of work units that is a direct multiple of the number of streaming multiprocessors and thus the number of blocks in the launch grid, and in fact each work unit calls for a replica of the same evaluations, so the load is very well balanced. With the faster kernel, the trendline suggests a launch latency of about 11 microseconds. With the slower kernel, the trendline suggests a launch latency of 0 or perhaps even -1 microseconds (so, zero). The first result seems within reason, though perhaps a bit higher than I usually hear kernel launch latency quoted at, but the second result seems a bit odd–I’d expect at least 5, especially given that the slower kernel involves much more detailed code (though the reason it is slower is not because it does more math, it actually does much less math but involves a lot more atomicAdd() operations).
I am running these calculations on an A40, the top-end visualization card, but I believe I am getting MAD returns off of the floating-point arithmetic two-fer that the GA102 series offers.
Can anyone comment here–does kernel launch latency indeed scale with the complexity of the kernel, is 11 microseconds a launch latency one might expect for a relatively simple kernel requiring 40 registers per thread and launched with five 256-threaded blocks per SM? Is it plausible that a very complex kernel (about 1600 lines in all) with 57-60 registers per thread launched on one 1024-threaded block per SM might have a very low launch latency?
Cheers!
(Edit: I am looking at several other systems of similar sizes, and the results remain more or less the same–7 microseconds’ estimated latency in the other systems for the fast kernel, and 0 to “negative 1” microseconds’ estimated latency in the slower kernel.)