Say I have a thousand tasks and I happen to know the weight of each one of them. What’s the right approach to do the load balance across my GPU cores on compute capability 1.2?
What I currently do is to sort the tasks be weight and spawn as many threads (actually blocks) as cores I have: <<<96, 1>>>. Then each thread runs a while loop; doing tid = atomicSub(count, 1) until tid < 1 and processing the task with id = tid (the global variable ‘count’ acts as a scheduler). Does this FIFO-based approach sound about right?