Advancing Computed Values.. Help

Any GPU expert out there =( … I need help … Anyone knows if it is possible to use a synchronization block inside a CUDA kernel for say a reduction operation, then advance the computed value to all threads of all blocks in one shot??

CUDA BLOCK ()
{

sync (reduction)
use the reduced value on all threads in all blocks

}

Any help would be appreciated. An obvious answer would be break the kernel into two kernels… but I don’t want that … Any other solutions ??

Unfortunately no, the only reliable and supported method for a kernel-wide synchronization is another kernel launch.

@hamada: You can do it in a single kernel if you give up the idea of a kernel-wide reduction and use the last block to add the partial sums.
The Threadfence reduction sample in the SDK shows this.

Thank you both,

What I provided as a pseudo-code was just a general example. What I exactly want to do is the following and is something similar to this;

CUDA Kernel()

{

// reduce d

for (j = 1; j = < N; j++) {  

        d = d + p[j]*q[j];  

}  

//on a single thread

alpha = rho0 / d;  

//On all threads

        z[j] = z[j] + alpha*p[j];  

        r[j] = r[j] - alpha*q[j];  

//Reduce rho

for (j = 1; j = lastcol-firstcol+1; j++) {  

        rho = rho + r[j]*r[j];  

… and then use rho later, and there are several functions with similar operations that follow.

}

What do you think @akavo

This is impossible without a second kernel invocation.