I am working on a GPU lower-power project. Does anyone have any idea about how to test the workload distribution on different multiprocessors on GPU? The default setting should be equivalent workload distributed, right? So, I just want to check if this is true in running time.
Another question, as far as I know, the CUDA runtime only synchronizes automatically the instructions within one block. Does it mean different multiprocessors can achieve different finishing time?
Any evaluations are appreciated!