Any GPU expert out there =( … I need help … Anyone knows if it is possible to use a synchronization block inside a CUDA kernel for say a reduction operation, then advance the computed value to all threads of all blocks in one shot??


sync (reduction)
use the reduced value on all threads in all blocks


Any help would be appreciated. An obvious answer would be break the kernel into two kernels… but I don’t want that … Any other solutions ??

Unfortunately no, the only reliable and supported method for a kernel-wide synchronization is another kernel launch.

@hamada: You can do it in a single kernel if you give up the idea of a kernel-wide reduction and use the last block to add the partial sums.
The Threadfence reduction sample in the SDK shows this.

Thank you both,

What I provided as a pseudo-code was just a general example. What I exactly want to do is the following and is something similar to this;

CUDA Kernel()


// reduce d

for (j = 1; j = < N; j++) {  

        d = d + p[j]*q[j];  


//on a single thread

alpha = rho0 / d;  

//On all threads

        z[j] = z[j] + alpha*p[j];  

        r[j] = r[j] - alpha*q[j];  

//Reduce rho

for (j = 1; j = lastcol-firstcol+1; j++) {  

        rho = rho + r[j]*r[j];  

… and then use rho later, and there are several functions with similar operations that follow.


What do you think @akavo

This is impossible without a second kernel invocation.