How to check the workload distribution on GPU


I am working on a GPU lower-power project. Does anyone have any idea about how to test the workload distribution on different multiprocessors on GPU? The default setting should be equivalent workload distributed, right? So, I just want to check if this is true in running time.

Another question, as far as I know, the CUDA runtime only synchronizes automatically the instructions within one block. Does it mean different multiprocessors can achieve different finishing time?

Any evaluations are appreciated!


There are no settings that allow you to control the distribution of blocks to multiprocessors. NVIDIA employees have mentioned that the scheduling algorithm works best with blocks that do equal work, but performance should be acceptable even if this is not true.

There is no automatic synchronization of instructions at the block level. The instructions in a warp are run together, but different warps within a block can get out of sync until they reach a __syncthreads() call.

Blocks can finish at different times, and so multiprocessors can finish at different times. However, the next kernel will not start until all multiprocessors are finished with the current kernel.

Hi seibert,

Thank you so much! Does it mean I have no way to get detailed workload distribution on different cores by modifying the benchmark code in CUDA?


The best you can do is microbenchmarks where you create kernels with blocks that have very asymmetric workloads and measure the time to completion of the entire kernel. There is no way to tell how busy a particular multiprocessor is, though.

You may not have seen this, but our AgPerfMon tool is capable of recording internal counters (including SM ID, warp ID and timestamps) and displaying them on a timeline, which can be useful for analysing workload distribution:

See page 21 in the documentation “CUDA Kernel internal events”.

Simon, thanks for that link.

Even working with CUDA every day, there’s so many tools out there that are still easy to miss. I never realized the PhysX developer tools could give any CUDA info.

There wouldn’t be any similar Linux tool, would there?