load balance problem in a SM among blocks?

Guys,

I encounter a problem which seems caused by load-balance problem among blocks belonging to the same SM.

Here’s the scenarios:

  1. We got N * M blocks for N task, each task acts as SPMD with M sub-task.

  2. Each task of N process different data set. With the same task, there will be M blocks which have the same execution time. But for different task, the block execution time is different. Like, for the first task, the execution time of total block is 2s. For the second task, the execution time is 0.05s. So the total time should be 2s+0.05s.

  3. If we define N = 1 and invoke the kernel with Grid as (1, M) for two times, we get the right result with 2.05s.

  4. If we define N = 2 and invoke the kernel with Grid as (2, M) for one time, we get the result with 3s. :unsure:

Why’s that?

Here’s my understanding…

There’s a rumor about two-level scheduler in GPU card, like a global scheduler for block dispatch, a scheduler in a SM for thread dispatch. So, for the case with single kernel invocation, If global scheduler put two block belonging to different task into the same SM, AND EACH FINISHED BLOCK CANNOT BE SCHEDULED OUT UNTIL ALL BLOCK IN A SM ARE FINISHED, then there will be load-unbalance problem in the SM. The attachment is the illustration.

However, If we invoke the kernel two times for 2 tasks, then all block within the same SM should be balanced. We got the right answer.

Is my understanding right? I don’t have other explanation for this perplexing phenomenon.

At least for pre-Fermi cards, your model seems to be the accepted one for how block scheduling works. A SM gets an allocation of blocks, which it runs until they are all completed, then it receives another allocation of block. If there is a big spread in the number of clock cycles required to retire all of the blocks in the SM, you get load inbalance.

On Fermi, I suspect it might be different, although that is still a suspicion at this stage.

Thanks avidday, Seems I must split the total tasks into N groups… Sigh…

And this results impossible implementation of global sync. right?

Pretty much, yes. There have been some somewhat successful attempts at implementing a global barrier, but it really isn’t part of the programming model and it should be considered the absolute last resort. Fermi brings the ability to run multiple kernels at the same time (currently 4 is the limit IIRC), so there is some more flexibility there. But this really isn’t a heterogeneous multithreading or MIMD architecture, and it is really pushing the boundaries to try and use it that way.