FIFO-Based Task Scheduling doing the load balance

Hi folks,

Say I have a thousand tasks and I happen to know the weight of each one of them. What’s the right approach to do the load balance across my GPU cores on compute capability 1.2?

What I currently do is to sort the tasks be weight and spawn as many threads (actually blocks) as cores I have: <<<96, 1>>>. Then each thread runs a while loop; doing tid = atomicSub(count, 1) until tid < 1 and processing the task with id = tid (the global variable ‘count’ acts as a scheduler). Does this FIFO-based approach sound about right?


In principle, this works. You probably need more threads(5.000-10.000) to fully utilize your hardware(multiple threads per core to hide latencies).
And most warps should execute the same codepath(same task) for good efficency.

Take a look at the “persistent threads” in the following paper:

Are multiple threads scheduled to the same core in compute capability 1.2?