get result of each block

Hi ,

I am using cuda c , i am really so confused , the problem is that :
I need to make each block (groupe of thread in the same block ) return his own result and copy it to CPU memory.

Can I store result of each block in the shared memeory and then copy it to global memory ?
How can I gather the result of each one ?

cross posting:

https://stackoverflow.com/questions/56314756/gather-result-of-each-block-of-threads