Copying large data amount shared to global ? cudaMemCpy doesn't work in kernel...

Hi everyone,

In the guide one of the advice is to copy the least number of largest amount of data from memory to memory.

So I would like each of my threads to stock computed results in shared memory and copy the results from shared memory to global memory only when 1/2 of shared memory amount / nb threads / multiprocessor is filled.

At this time I’d like to do something like a cudaMemCpy of 64 bytes in the kernel, but nvcc seems not to be happy with this instruction.

I could define a structure of 64 bytes to do the transfert with an assignment statement it but that’s not a cute way …

How would you do it ??


In my testing, it is always faster to do two 32-bit assignments (fully coalesced) than one 64 or 128-bit assignment (fully coalesced). The guide goes into great length about how to get fully coalesced reads/writes for large structures, but it doesn’t make it too obvious that doing such is a compromise. Basically, a “structure of arrays” should be preferred to an “array of structures” unless you absolutely need the array of structures and can’t get around it. Maybe the guide talks about this, I don’t recall.

By far, the absolute most important thing as far as mem transfer performance goes is to have fully coalesced memory reads/writes to global mem. Check out the guide for details.