Hi all,
in my application I have to repeatedly copy large chunks of data device-to-device from a regular array (allocated with cudaMalloc) to a 3D cudaArray, for access via a 3D texture reference. According to the CUDA visual profiler, the transfer of 512x256x512 single-precision floats (= 256 MB) of data takes 49.2 msec, which corresponds to a transfer rate of roughly 5 GB/sec. This transfer rate is also consistent with “wall clock” timing. I find this rather slow compared to the usually claimed > 70 GB/s bandwith, even if you would have to divide the latter number by two for read/write operations. I am running on a Tesla C1060 card. [topic=“109721”]This post[/topic] describes a similar problem, but nobody has answered there yet.
Has anyone else observed a similar behavior? Is there a workaround to speed up the transfer?
Thanks so much!
Code examples:
Destination cudaArray allocation:
[codebox]
cudaArray* imgArray;
cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);
imgExtent = make_cudaExtent(512, 256, 512);
CUDA_SAFE_CALL (cudaMalloc3DArray (&imgArray, &channelDesc, imgExtent));
[/codebox]
Source array allocation:
[codebox]
cudaPitchedPtr ddImage;
cudaExtent imgExtentByte = make_cudaExtent(512*sizeof(float), 256, 512);
CUDA_SAFE_CALL(cudaMalloc3D (&ddImage, imgExtentByte));
[/codebox]
Memcopy call:
[codebox]
cudaMemcpy3DParms aParms = {0};
imgExtent = make_cudaExtent(512, 256, 512);
aParms.srcPtr = ddImage;
aParms.dstArray = imgArray;
aParms.extent = imgExtent;
aParms.kind = cudaMemcpyDeviceToDevice;
CUDA_SAFE_CALL (cudaMemcpy3D (&aParms));
[/codebox]