Any GPU expert out there =( … I need help … Anyone knows if it is possible to use a synchronization block inside a CUDA kernel for say a reduction operation, then advance the computed value to all threads of all blocks in one shot??
CUDA BLOCK ()
use the reduced value on all threads in all blocks
Any help would be appreciated. An obvious answer would be break the kernel into two kernels… but I don’t want that … Any other solutions ??