Replicating cuMemcpyDtoD performance

cuMemcpyDtoD completes in 2.5ms for a 85MB -> 85MB intra-device copy; about 68GB/s, quite respectable for an 8800 GTX. My equivalent kernel:

__global__ void Copy(float3 *src, float3 *dst, int nElems) {

  int i = blockIdx.x * kNThreads + threadIdx.x;

  for (; i < nElems; i += kBlockStride) {

    dst[i] = src[i];

  }

With kNThreads set to 512, kNBlocks to 128 and kBlockStride set to (512128 = 65536), this runs (cuEvent surrounding cuLaunchGrid) in 22.4ms; only 7.5GB/s.

Should I not be able to equal cuMemcpyDtoD performance, as this is surely how it is implemented internally? It should be a very memory bound algorithm, so I am not certain how much influence I can have inside the kernel.

float3 read/write are not coalesced.
You could use shared memory ( as shown in the SC07 optimization talk available at www.gpgpu.org/sc2007) to improve performances.

Or, since you are just copying the data, you can just treat the data as an array of floats instead of float3. With coalesced reads/writes, you should achieve ~70 GiB/s.

Just out of curiosity, why not call cuMemcpyDtoD?

Thanks, I’m sure that’s what the problem will be. The documented magnitude of performance difference is identical to what I’m experiencing.

I used copy to simplify the example. However, I am trying to achieve similar read/write performance in more complex kernels.

OK, I’m nearly there. Coalescing brought my float3 kernel up to 52.8GB/s and a simple float version of the kernel achieves 63.2GB/s, with thread count/block size derived exhaustively (512 threads, 80 blocks).

This is very close to cuMemcpyDtoD’s performance, but it peaks a little higher at 67.1GB/s. Is there anything further I can do to optimise this kernel?

__global__ void Copy_Kernel(float *sIdxr,float *dIdxr,int __nElems)

{

int __i = blockIdx.x * blockDim.x + threadIdx.x;

for (; __i < __nElems; __i += gridDim.x * blockDim.x) {

dIdxr[__i] = sIdxr[__i];

}

}

Try to use the fast 24 bit integer multiply:

__global__ void Copy_Kernel(float *sIdxr,float *dIdxr,int __nElems)

{

int __i = __mul24(blockIdx.x * blockDim.x ) + threadIdx.x;

for (; __i < __nElems; __i +=__mul24(gridDim.x * blockDim.x)) {

dIdxr[__i] = sIdxr[__i];

}

}

Thanks. That’s an optimisation worth having, I’ve dropped it into my code generator.

However, the execution time for this kernel remains unchanged. I guess that implies that the kernel is completely memory bound. I’d expect that for copy, I wonder what more I can do to increase memory bus utilisation.

This looks rather low. I get ~74.7 GB/s in cudaMemcpy and ~76.8 GB/s in custom code on 8800 GTX. Are you sure that you use standard units? Usually, 1 GB/s is 10^9 B/s, not 2^30 B/s.

I’d like to know, why copy bandwidth is about same as read bandwidth (~74 GB/s) when write bandwidth is substantially lower (~55 GB/s).

Sorry, yes, I really mean GiB (2^30). That would place my cuMemcpy at 72GB/s and my implementation at 67.9GB/s.

Small differences can be accounted for by OEM clock tweaks.