I am working on a GPU lower-power project. Does anyone have any idea about how to test the workload distribution on different multiprocessors on GPU? The default setting should be equivalent workload distributed, right? So, I just want to check if this is true in running time.
Another question, as far as I know, the CUDA runtime only synchronizes automatically the instructions within one block. Does it mean different multiprocessors can achieve different finishing time?
There are no settings that allow you to control the distribution of blocks to multiprocessors. NVIDIA employees have mentioned that the scheduling algorithm works best with blocks that do equal work, but performance should be acceptable even if this is not true.
There is no automatic synchronization of instructions at the block level. The instructions in a warp are run together, but different warps within a block can get out of sync until they reach a __syncthreads() call.
Blocks can finish at different times, and so multiprocessors can finish at different times. However, the next kernel will not start until all multiprocessors are finished with the current kernel.
The best you can do is microbenchmarks where you create kernels with blocks that have very asymmetric workloads and measure the time to completion of the entire kernel. There is no way to tell how busy a particular multiprocessor is, though.
Even working with CUDA every day, there’s so many tools out there that are still easy to miss. I never realized the PhysX developer tools could give any CUDA info.
There wouldn’t be any similar Linux tool, would there?