I’m trying to look for how Nvidia Kepler especially K80 thread block scheduler work, but I couldn’t find it. I need it because I want to create deadlock free on my algorithm.
Anyone could help? Or how could I deliver the experiment about it?
ideology of GPU computing is that you run a bunch of independent thread blocks. The only ideology-correct way to interact between them is to use global atomics. Order of execution of thread blocks inside of one kernel isn’t specified. Also you may find useful the dynamic parallelism. There is also a “persistent threads” approach for true heros :)
Details of threadblock scheduling are not formally published by NVIDIA.
Therefore you would have to write your own experiments to attempt to discover things experimentally.
Do google searches on “CUDA microbenchmarking” to get some ideas for methods people use to discover unpublished technical details of CUDA devices.
Writing a global sync (other than the kernel launch itself) in CUDA is an exercise fraught with peril. A flexible global sync in CUDA generally requires that you launch a limited number of threadblocks (the number is specific to your specific device), so as to not exceed the instantaneous carrying capacity of the device, allowing all threads in the grid to make forward progress. If you do google searches on “CUDA critical section” or “CUDA atomic lock” you will find various discussions of it.