sum up blockresults with last thread-block?

Hey everybody,

I use reduction code to build a sum of an 216x216 array. Therefore i have 216 blockresults in the global memory after reduction.
now i added code like this:
if (bid == GRIDSIZE){
do reduction of the 216 global mem values

I tried it out and it worked for my tests.
But I am unsure if this really works in every case (last Thread-Block is always really the last Thread-Block…)
Can anyone tell me if this should work everytime or if I have to call another kernel to be sure every block from last kernel has finished?

Thx a lot!


It sounds like you have correct suspicions. This WON’T always work every time.

There’s no guarantee that the highest block ID will be the last to run… and even if it is, other blocks may be also running in parallel on other MPs.

It WOULD work on the emulator though since it’s just single-thread simple.

You’re right that calling a new kernel would work, since that gives you gross grid-wide synchronization. Awkward but maybe fine for your app.

You could also try using atomics… 216 atomic ops will likely be much faster than a whole kernel launch. That depends on your sum type (integer or float?) and compute hardware (1.1 needed.)

Last, you could do the last sum on the CPU, it’'s not a big deal to do a memcpy and loop over 200 elements. But of course this all depends, maybe you need the result on the device, not the host, so sending the data to host and back is a waste.

A lot of your decisions depend on the size of that array, 200 is pretty cheap no matter where you do it, but if it gets larger like 50000 or something then a second kernel call is probably best.

Thx a lot for your answer!
I will sum up the blockresults with CPU, this will give the best performance since there are only 256 blockresults at most for my program.