Hi ,
I am using cuda c , i am really so confused , the problem is that :
I need to make each block (groupe of thread in the same block ) return his own result and copy it to CPU memory.
Can I store result of each block in the shared memeory and then copy it to global memory ?
How can I gather the result of each one ?