I have an application where multiple ranks are accessing a single GPU. Each rank launches a kernel on the same GPU. Each kernel launch performs a type of string comparison, each Cuda blocks compares two strings. In a typical problem I am performing several millions of such comparisons so the number of blocks is equal to the number of comparisons. I just noticed that when I fix the number of Cuda blocks to about 100,000 and do multiple iterations of the kernel to complete all the string comparisons I get much better performance (10x faster). I can’t seem to figure out the reason. Can anyone please shed some light on this?
Scheduling blocks on a GPU is an operation that takes time - there is a block scheduling overhead. Scheduling may not occur instantly after an individual block has finished on a multiprocessor. The scheduler may wait for enough idle multiprocessors before springing into action.
When the blocks have vastly different amount of runtime to complete their jobs (e.g. due to much different string lengths), there could be some multiprocessors sitting idle until the block scheduler activates again and provides more blocks to execute.
By introducing an amount of iterations per block you are effectively causing the block run times to converge as the runtime of individual iterations is getting averaged out. The individual blocks are going to take about the same amount of execution time. Hence the time span that individual multiprocessors may sit idle will be very short (compared to block execution times). The hardware will be utilized better on average.
Some experiments to try:
Have you tried to launch a No-Op kernel with a lot of blocks? Does it have significant runtime still?
Do run times improve if you force all those string lengths to be identical?
Have you queried the nvidia profiler for detailed stats on hardware utilization?