Memory copy for max coalescing

I have several host memory arrays, each of which have width x height image elements.
The different buffers include pixel values, and structs which contain results from CUDA image processing.

For maximum coalescing, I need to interleave these arrays so result elements corresponding to pixel values are adjacent in memory, as adjacent threads will process adjacent pixels.

One way would be to interleave them in host memory before copy to the device, but that would not be as efficient as interleaving them during memory copy to the device.

I am looking for a way to interleave these arrays by specifying a per element spacing in the src or dst memory.

At first I thought perhaps the srcPitch/dstPitch in cuMemcpy2D, by specifying a pitch of 1 in the src and a pitch of the size of total pixel data in the dst, would fill the bill, but now I am not optimistic.

Is there an API call suitable for interleaving buffers during copy to the device for maximum coalescing?

Your idea of using cudaMemcpy2D() for such interleaving tasks is spot on. I have used this to interleave complex data from an SOA on the host to an AOS on the device, and it worked fine:

Be warned though that 2D copies with small strides are slow, as the DMA engines that perform the copies are not optimized for this case. I seem to recall that the throughput for the above scenario was less than 1/20 of the throughput for 1D copies of large contiguous blocks. The performance many differ based on GPU and presumably differs based on stride, but I did not perform comprehensive experiments to find out.

Thanks for your input.

I see your example uses cudaMemcpy2D. Is that preferable to cuMemcpy2D?

Would one be faster than the other for small pitches?

I always use the CUDA runtime API. I was, quite literally, the first user of the CUDA runtime API, and never used the driver API ever again. Performance-wise it should make no difference whether you use cudaMemcpu2D() [runtime API] or cuMemcpy2D() [driver API], the actual work is accomplished by the same hardware DMA engine regardless of which API is used.

[Later:] To avoid misunderstandings: I did not create the CUDA runtime API, I was just the first engineer at NVIDIA to use it after it had been created by my colleague, back in 2005. He needed somebody to take this brand new interface for a spin, and so I did. Ancient history …

This was actually my first attempt to use the driver API. I just happened to find it first while looking for a more flexible memcpy, not realizing that the runtime API had an equivalent function.

Best I am aware, most of the functionality in the run-time and driver APIs is identical and there is generally no performance reason not to use the run-time API, as it is really a thin wrapper around CUDA driver functionality. As I understand it, the driver API can offer increased control and flexibility in some aspects of CUDA, and I think the manipulation of kernels (loading, compiling, launching) is one such area. The driver API is apparently providing enough incremental advantages to some CUDA developers that it has never been abandoned despite the fact that most CUDA programmers use the run-time API.

Unfortunately, the cudaMemcpy2D interleaved copying is so much slower, it negates any speedup from coalescing and is much slower overall.

I think it would be really advantageous, if there were an interleaving cudaMemcpy which was optimized for small strides. Seems like that would be useful in a lot of applications.

As I mentioned, the host<->device transfers are performed by hardware (DMA engines), and this is what primarily determines performance. Any software overhead is likely completely minimized at this point.

Improving hardware performance typically involves real costs due to increases in silicon real estate, so any such modifications tend to face high hurdles to justify the additional cost. As far as I am aware, this particular copy performance issue has not come up often at all. You may consider filing a CUDA enhancement request, via the bug reporting form linked from the registered developer website. Please prefix the subject line with "RFE: " to mark it as an enhancement request rather than an actual bug.

For the time being, if adjusting the host-side data structures isn’t an option, I would suggest trying to re-arrange the data in the desired way using a kernel running on the GPU.

I tried interleaving the data on the host, and while faster than cudaMemcpy2D, it is slower than my original non-interleaved implementation.

It seems that using a kernel would only incur the non-coalesced global memory access I’m trying to avoid in a different place with the added overhead of an additional kernel launch.

@robosmith, as @njuffa hints, if it’s just a simple stride/interleave then you could perform several properly coalesced loads, one for each host memory array type, and then interleave them in the kernel. Then you write several coalesced stores. Or just do your image processing in the same kernel.

If you want to pull the data from the host and write it out in interleaved format so it can be loaded by other kernels (more than once) then find the right number of loads to perform in order to write out native memory transactions.

For example, if you have an array of 32-bit words and 2 1-byte arrays and want to interleave and pack them into a 6-byte structure then your kernel should probably load enough array data to perform 12-byte (64-bit + 32-bit) or 24-byte (128-bit + 64-bit ) stores.

My original implementation has several coalesced loads and stores.

My goal is to reduce them to one coalesced load and one coalesced store, which I could do with properly interleaved data.

However, right now it appears to take more time to interleave the data than is saved.

This is entirely possible. Your total bandwidth requirements may well be higher if you have to touch the data multiple times to massage it into an optimized layout. If your code is already bandwidth constrained, this would lead to decreased performance.

Due to hardware improvements over the years, modern GPUs often do quite well without fully coalesced accesses. Depending on the GPU you use, you may be able to squeeze out some more performance by helping the compiler generated more loads through the texture path. The first line source code adjustments for this would be appropriate use of ‘const’ and ‘restrict’ modifiers for pointer arguments. See the Best Practices Guide.

It is really hard to provide more than vague analysis and/or suggestions without thorough knowledge of the code and its current performance characteristics. My standard advice these days is to let the CUDA profiler guide optimization efforts.