In the guide one of the advice is to copy the least number of largest amount of data from memory to memory.
So I would like each of my threads to stock computed results in shared memory and copy the results from shared memory to global memory only when 1/2 of shared memory amount / nb threads / multiprocessor is filled.
At this time I’d like to do something like a cudaMemCpy of 64 bytes in the kernel, but nvcc seems not to be happy with this instruction.
I could define a structure of 64 bytes to do the transfert with an assignment statement it but that’s not a cute way …
How would you do it ??