cuMemcpyDtoD completes in 2.5ms for a 85MB → 85MB intra-device copy; about 68GB/s, quite respectable for an 8800 GTX. My equivalent kernel:
__global__ void Copy(float3 *src, float3 *dst, int nElems) {
int i = blockIdx.x * kNThreads + threadIdx.x;
for (; i < nElems; i += kBlockStride) {
dst[i] = src[i];
}
With kNThreads set to 512, kNBlocks to 128 and kBlockStride set to (512128 = 65536), this runs (cuEvent surrounding cuLaunchGrid) in 22.4ms; only 7.5GB/s.
Should I not be able to equal cuMemcpyDtoD performance, as this is surely how it is implemented internally? It should be a very memory bound algorithm, so I am not certain how much influence I can have inside the kernel.
float3 read/write are not coalesced.
You could use shared memory ( as shown in the SC07 optimization talk available at www.gpgpu.org/sc2007) to improve performances.
Or, since you are just copying the data, you can just treat the data as an array of floats instead of float3. With coalesced reads/writes, you should achieve ~70 GiB/s.
OK, I’m nearly there. Coalescing brought my float3 kernel up to 52.8GB/s and a simple float version of the kernel achieves 63.2GB/s, with thread count/block size derived exhaustively (512 threads, 80 blocks).
This is very close to cuMemcpyDtoD’s performance, but it peaks a little higher at 67.1GB/s. Is there anything further I can do to optimise this kernel?
Thanks. That’s an optimisation worth having, I’ve dropped it into my code generator.
However, the execution time for this kernel remains unchanged. I guess that implies that the kernel is completely memory bound. I’d expect that for copy, I wonder what more I can do to increase memory bus utilisation.
This looks rather low. I get ~74.7 GB/s in cudaMemcpy and ~76.8 GB/s in custom code on 8800 GTX. Are you sure that you use standard units? Usually, 1 GB/s is 10^9 B/s, not 2^30 B/s.
I’d like to know, why copy bandwidth is about same as read bandwidth (~74 GB/s) when write bandwidth is substantially lower (~55 GB/s).