3D device-to-device memcopy to cudaArray slow?

Hi all,

in my application I have to repeatedly copy large chunks of data device-to-device from a regular array (allocated with cudaMalloc) to a 3D cudaArray, for access via a 3D texture reference. According to the CUDA visual profiler, the transfer of 512x256x512 single-precision floats (= 256 MB) of data takes 49.2 msec, which corresponds to a transfer rate of roughly 5 GB/sec. This transfer rate is also consistent with “wall clock” timing. I find this rather slow compared to the usually claimed > 70 GB/s bandwith, even if you would have to divide the latter number by two for read/write operations. I am running on a Tesla C1060 card. [topic=“109721”]This post[/topic] describes a similar problem, but nobody has answered there yet.

Has anyone else observed a similar behavior? Is there a workaround to speed up the transfer?

Thanks so much!

Code examples:

Destination cudaArray allocation:

[codebox]

cudaArray* imgArray;

cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat);

imgExtent = make_cudaExtent(512, 256, 512);

CUDA_SAFE_CALL (cudaMalloc3DArray (&imgArray, &channelDesc, imgExtent));

[/codebox]

Source array allocation:

[codebox]

cudaPitchedPtr ddImage;

cudaExtent imgExtentByte = make_cudaExtent(512*sizeof(float), 256, 512);

CUDA_SAFE_CALL(cudaMalloc3D (&ddImage, imgExtentByte));

[/codebox]

Memcopy call:

[codebox]

cudaMemcpy3DParms aParms = {0};

imgExtent = make_cudaExtent(512, 256, 512);

aParms.srcPtr = ddImage;

aParms.dstArray = imgArray;

aParms.extent = imgExtent;

aParms.kind = cudaMemcpyDeviceToDevice;

CUDA_SAFE_CALL (cudaMemcpy3D (&aParms));

[/codebox]

“array” has lot of Z-order kinda data-arrangement… whereas the raw allocationg “cudaMalloc3D” is like a CPU 3D array…

Instead, try array to array copies… That should work good.

Thanks, Sarnath, but I need to modify data in a kernel (can’t do that in a cudaArray, so I need “raw” memory) and later to copy that to a cudaArray. So array to array copy doesn’t seem like a solution to me …

I think they must be doing the Z_order curving in GPU itself resulting in horrible un-coalescedness. if you have time (that u dont mind wasting about), you may want to first copy the 3D pitchedPointer (allocated by cudaMalloc3D) to a host 3D pitchedPointer and then copy the host3D Pitched Pointer to the CUDAArray. I suspect this method might try to do a Z-order curving in CPU first and the copy it back to GPU in one shot. (Consider pinned memory for performance )

Its my guess. It could be wrong. So, its your time and your discretion! Gooood Luck!

–edit-- PS:

First of all you need to profile the time taken to copy 256MB onto Host RAM. If that gives u a 5GB/s bandwidth - then there is no point in doing the exercise above. I dont think you have a choice.

If you have hell lot of RAM, then you could PIN 256MB in memory, make a pitchedPointer out of it and then copy the GPU memory onto it and then Copy it back to the GPU 3DArray. Check the memcopy speed first before proceeding…

Retrospecting, I think you are hitting 5GB/s because NV driver is doing what I have said above. 5GB/s is the PCIe bandwidth, isnt it? :-)

May be, they dont take the liberty of pinned memory… And are copying in chunks… You may want to take the liberty of pinned memory…since u know what is good for your system…

Anyway, one can always experiment and find what works good!

If you find a good slution, kindly post it here for the community’s benefit. Thanks!

" 5GB/s is the PCIe bandwidth, isnt it? :-) "
I also believe that for GPU computing with often+big memory transfers the PCI Bus is the bottleneck (today).
There is an cuda code example bandwithtest which shows that - hug memory copy speed differencex between devive>over PCI>device to onboard VRAM memcopy.

I ran some tests, and any of the following memcopy operations results in 4-5 GB/sec transfer speed, tested with 64MB transfer size
(d = device, h = host, a2D = 2D array, a3D = 3D array, r = raw memory):

  • d r → d a3D
  • d r → d r
  • d r → d a2D
  • h r → d r
  • h r → d a3D

So no help from replacing my memcopy with any of those. But, copying the same amount of memory with a very simple kernel (coalesced copy, device raw mem to device raw mem) gives 36 GB/sec transfer speed. Unfortunately, I cannot use this to gain 3D tex access, since neither writing to a cudaArray in a kernel nor 3D texturing from linear memory is supported.

Any other ideas?

I have a similar performance bottleneck in my application (repeatedly copying dynamic data into 3D textures for subsequent access) and I have investigated several strategies to improve performance.

When Z ordering the data into texture memory, I doubt the driver copies the data to the host and back just to avoid uncoalesced memory access. Z ordering on the host certainly wouldn’t have a better memory access pattern than on the device, wouldn’t compute the Z indices (or inverse Z indices depending on how you do it) in parallel, and would have the additional overhead of 2 PCIe copies. I’d need to see stronger evidence for this.

Calculating the Z index from the 1D index to figure out where to write the data (or alternatively calculating the 1D index from the Z index to figure out where to read the data) is surprisingly expensive and certainly contributes to the performance degradation we see when doing these copies.

When you subsequently access your dynamic data via tex3D( … ), do you care about trilinear filtering or normalized coordinates? Or do you simply care about taking advantage of the texture cache because there is some amount of spatial locality in how you access the data?

If you don’t care about filtering or normalized coordinates, you can bind your global memory directly to a 1D texture reference. I do this in some parts of my application to avoid the slow “memcopy” into texture memory.

I get significantly improved performance in my application by doing it this way, but this technique comes with a few caveats:

  • You need to sample the texture with 1D indices so you need to convert your 3D indices into 1D indices. You need to do this anyway when writing to global memory so you probably already have efficient code to do the conversion.

  • It is possible to go out of bounds with undefined results. You either need to guarantee at compile time that you will never go out of bounds, or you need to boundary check every texture access that might go out of bounds. Doing the boundary check actually costs a non-trivial amount of FLOPS so think carefully about this.

  • When sampling the 1D texture, the texture cache is populated with memory that is nearby in 1D. To take advantage of 3D spatial locality, you need to index your data using some non-canonical ordering. In my application I experimented with Hilbert indexing, Z indexing, and simple tiling (canonically ordered 3D tiles with canonically ordered memory within each tile). I thought Hilbert indexing was going to give me the best performance, but it actually gave the worst because computing the Hilbert index for each sample was so expensive. Simple tiling turned out to give me the best performance for my application and also imposed the fewest restrictions on the data size. Hilbert and Z indexing require equal-power-of-two dimensions whereas tiling only requires each dimension to be a multiple of the tile size.

I hope that helps you or someone else :)

cheers,
mike

Mike, thanks for your post, your analysis is certainly helpful.

I do care about trilinear interpolation and texture clamping, so using a 3D texture is really convenient. However, faster memcopy might outweigh the cost of doing interpolation/clamping myself in the kernel. The problem with linear texture access is that the 2^27 elements, to which a 1D texture is limited, might not be enough. But what I should maybe do is bind a 2D texture to linear memory. Then I get at least bilinear interpolation for free, meaning that the code only has to do bounds checking and one linear interpolation. This still needs a non-trivial memory layout (simple tiling).

Seeing that I can get 36 GB/s with a simple kernel that does device-to-device (raw to raw) memory copy, I assume it should be possible for NVIDIA to provide a similar kernel for raw-to-array copying (they know what the cudaArray memory layout looks like, we don’t …). The efficient matrix transpose examples make me think that Z ordering should not be an insurmountable obstacle. Anybody from NVIDIA reading that?

Yes, the host to 3D array and device to 3D array transfer rates are rather slow. It’s a bottleneck for 3D texture related applications.