Replicating cuMemcpyDtoD performance

cuMemcpyDtoD completes in 2.5ms for a 85MB -> 85MB intra-device copy; about 68GB/s, quite respectable for an 8800 GTX. My equivalent kernel:

__global__ void Copy(float3 *src, float3 *dst, int nElems) {

  int i = blockIdx.x * kNThreads + threadIdx.x;

  for (; i < nElems; i += kBlockStride) {

    dst[i] = src[i];


With kNThreads set to 512, kNBlocks to 128 and kBlockStride set to (512128 = 65536), this runs (cuEvent surrounding cuLaunchGrid) in 22.4ms; only 7.5GB/s.

Should I not be able to equal cuMemcpyDtoD performance, as this is surely how it is implemented internally? It should be a very memory bound algorithm, so I am not certain how much influence I can have inside the kernel.

float3 read/write are not coalesced.
You could use shared memory ( as shown in the SC07 optimization talk available at to improve performances.

Or, since you are just copying the data, you can just treat the data as an array of floats instead of float3. With coalesced reads/writes, you should achieve ~70 GiB/s.

Just out of curiosity, why not call cuMemcpyDtoD?

Thanks, I’m sure that’s what the problem will be. The documented magnitude of performance difference is identical to what I’m experiencing.

I used copy to simplify the example. However, I am trying to achieve similar read/write performance in more complex kernels.

OK, I’m nearly there. Coalescing brought my float3 kernel up to 52.8GB/s and a simple float version of the kernel achieves 63.2GB/s, with thread count/block size derived exhaustively (512 threads, 80 blocks).

This is very close to cuMemcpyDtoD’s performance, but it peaks a little higher at 67.1GB/s. Is there anything further I can do to optimise this kernel?

__global__ void Copy_Kernel(float *sIdxr,float *dIdxr,int __nElems)


int __i = blockIdx.x * blockDim.x + threadIdx.x;

for (; __i < __nElems; __i += gridDim.x * blockDim.x) {

dIdxr[__i] = sIdxr[__i];



Try to use the fast 24 bit integer multiply:

__global__ void Copy_Kernel(float *sIdxr,float *dIdxr,int __nElems)


int __i = __mul24(blockIdx.x * blockDim.x ) + threadIdx.x;

for (; __i < __nElems; __i +=__mul24(gridDim.x * blockDim.x)) {

dIdxr[__i] = sIdxr[__i];



Thanks. That’s an optimisation worth having, I’ve dropped it into my code generator.

However, the execution time for this kernel remains unchanged. I guess that implies that the kernel is completely memory bound. I’d expect that for copy, I wonder what more I can do to increase memory bus utilisation.

This looks rather low. I get ~74.7 GB/s in cudaMemcpy and ~76.8 GB/s in custom code on 8800 GTX. Are you sure that you use standard units? Usually, 1 GB/s is 10^9 B/s, not 2^30 B/s.

I’d like to know, why copy bandwidth is about same as read bandwidth (~74 GB/s) when write bandwidth is substantially lower (~55 GB/s).

Sorry, yes, I really mean GiB (2^30). That would place my cuMemcpy at 72GB/s and my implementation at 67.9GB/s.

Small differences can be accounted for by OEM clock tweaks.