Inter-Block Dependency

vandop · January 2, 2011, 6:43pm

Hi,

I’m making kind of a simulation that simulate a system by parts, but I’m afraid that I’ve fallen into a deadlock…

Imagine this situation:

we have 3 System Blocks (S1,S2 and S3). (A block is a black box simulation that can be done without exchange any info with another systems)

They are connected in this fashion:

S1 ------|
…|-------S3 (i.e: S3 depends on S1 and S2 outputs)
S2-------|

So S1 and S2 can be parallelized but S3 only can be simulated after S1 and S2 finish their work, because it will use their output.

So what I’m doing is simulating S1 and S2 (on different Thread Blocks) and after that copying the results to Global memory.

My Problem now is signaling S3 to start when S1 and S2 finish their work… but as far as I know this is not an easy task on CUDA architecture because, thread blocks are supposed to be independent… In this case, they are kind of independent when simulating, but they need to be fed by prior simulation blocks that are simulated in parallel.

Could someone help me, giving some advices how can I solve this problem?

Please make me know, if my explanation was not clear enough, sometimes I feel some problems explaining myself in English:)

Thank you

tmurray · January 2, 2011, 8:10pm

Thread blocks have to be independent. Use multiple kernels.

Lev · January 2, 2011, 8:32pm

Hi,

I’m making kind of a simulation that simulate a system by parts, but I’m afraid that I’ve fallen into a deadlock…

Imagine this situation:

we have 3 System Blocks (S1,S2 and S3). (A block is a black box simulation that can be done without exchange any info with another systems)

They are connected in this fashion:

S1 ------|

…|-------S3 (i.e: S3 depends on S1 and S2 outputs)

S2-------|

So S1 and S2 can be parallelized but S3 only can be simulated after S1 and S2 finish their work, because it will use their output.

So what I’m doing is simulating S1 and S2 (on different Thread Blocks) and after that copying the results to Global memory.

My Problem now is signaling S3 to start when S1 and S2 finish their work… but as far as I know this is not an easy task on CUDA architecture because, thread blocks are supposed to be independent… In this case, they are kind of independent when simulating, but they need to be fed by prior simulation blocks that are simulated in parallel.

Could someone help me, giving some advices how can I solve this problem?

Please make me know, if my explanation was not clear enough, sometimes I feel some problems explaining myself in English:)

Thank you

If your time is not very short, multiple kernells is best decision in many ways.

vandop · January 8, 2011, 11:34am

But multiple kernels only can run in parallel on newer Fermi architectures right? I have no access to it, right now:\ a

I found this article about Inter-Block GPU Communication:

Once the sync points only happen between few thread blocks, may the cost is not to high… I hope:\

But I will think in your advice and see how much money I’ll need to spend to get a Fermi GPU.

Thank you

tera · January 8, 2011, 1:06pm

You can make a kernel that does both S1 and S2, and then a second kernel that does S3. This gives you maximal parallel execution using just standard CUDA idioms.

Lev · January 8, 2011, 1:33pm

How many threads have you in s1 and s2?

vandop · January 8, 2011, 9:51pm

The code for S1,S2 and S3 is identical, just the input change. But S3 will be fed by S1 and S2 output.

This will keep occupied 2 Thread Blocks, but I’m creating dozens to make identical parallel work on other Blocks.

Its possible that S1 and S2 are finished, but somewhere a Sx is doing work independent from this ones… So I’m not able to schedule another Kernel without waiting for all the simulations :\

You could imagine a complex system where Subsystems (S1,S2, Sx) are all processed in same manner (same code), but with different inputs. Some Subsystem depends on other Subsystems output and cannot start immediately, so I need to maximize the number of systems being processed in parallel, assuming that at each time I know those that are independent from each others…

I’ll try a better example:

S1

…|---- S3

S2

S4-------\

S5--------|–S7

S6------- /

S1 and S2 take 1 sec to finish… so S3 could start just after 1 sec

S4,S5 and S6 take 1 min to finish… so S7 could start after 1min

So I don’t want to wait to S4,S5 and S6 to finish before start S3, so can’t let Kernel return:\ I need to signal S3 and schedule it to one free Thread in some free threadblock.

For each Subsystem (Sx) I’m processing in parallel, using threads, the same subsystem architecture but with different input. So that I can take advantage from Shared Memory.

Each Sx process different Architecture but the code is exactly the same. Subsystems Architectures are described on Device Memory, but they are fetched to shared memory when a Subsystem start their work.

The number of threads used for each subsystem is only limited by the amount of shared memory, that to hold the output + system or the max number of threads that a Thread Block can hold.

Thank you

Lev · January 8, 2011, 10:24pm

try to use atomics, use one counter for starting blocks and another for completed block number.

tera · January 8, 2011, 10:29pm

If all the Sn correspond to the same program code, but need different time to execute, persistent threads seem to be the way to go.

You can keep (using atomic operations) a queue of simulations that are ready to run. Each block, instead of starting a fixed simulation, takes the first item from the queue of simulations whose dependencies are fulfilled. Then at the end of the simulation, it puts all other simulations onto the queue whose dependencies are now fulfilled.

Then, it takes the next simulation from the queue and starts over again (alternatively it could just finish leaving the rest of the work to the remaining blocks. However, on 1.x devices new blocks are started synchronously, which involves unnecessary waiting on all other currently executing blocks to finish).

The difference of this approach to what you are suggesting is that the actual item of work that each block does is not determined by its block number. That way, blocks do not wait for “their” simulation to become ready (introducing all kinds of deadlock-prone complications), but just start execution on the first item of work that is available.

Persistent threads are not an easy subject though, particularly for a beginner with CUDA. But if you feel up to it, google for “CUDA persistent threads” for more information.

Lev · January 8, 2011, 11:02pm

Tera, where do you obtain this information?
“However, on 1.x devices new blocks are started synchronously, which involves unnecessary waiting on all other currently executing blocks to finish).”

tera · January 8, 2011, 11:47pm

It’s quite easy to test: Have each block run for a predetermined time using clock(), and maybe also record the order of execution and %smid each block runs on in a list using atomic operations. You’ll see that the total runtime is determined by the maximum runtime in each wave of concurrently executed blocks (not the maximum of the total execution times of each SM).

There also was a very interesting thread in the forum on this a while back, I’ll see if I can dig it out again.

Lev · January 8, 2011, 11:51pm

Thanks, now I understand why in my program persistent threads give big impact on gt200 and much lesser on Fermi.

tera · January 9, 2011, 12:23am

The old thread is here: Scheduling block execution.

vandop · January 9, 2011, 11:37am

Thank you so much for showing the righ path :)

Once I don’t have a 2.x device, I’ll give a try to Persistent Threads, I can imagine why they can cause big headaches to beginners like me, but I need them:P

Thank you again, I’m sure that I’ll back here in a near future because of this:P