Memory copy for max coalescing

robosmith · January 6, 2015, 9:45pm

I have several host memory arrays, each of which have width x height image elements.
The different buffers include pixel values, and structs which contain results from CUDA image processing.

For maximum coalescing, I need to interleave these arrays so result elements corresponding to pixel values are adjacent in memory, as adjacent threads will process adjacent pixels.

One way would be to interleave them in host memory before copy to the device, but that would not be as efficient as interleaving them during memory copy to the device.

I am looking for a way to interleave these arrays by specifying a per element spacing in the src or dst memory.

At first I thought perhaps the srcPitch/dstPitch in cuMemcpy2D, by specifying a pitch of 1 in the src and a pitch of the size of total pixel data in the dst, would fill the bill, but now I am not optimistic.

Is there an API call suitable for interleaving buffers during copy to the device for maximum coalescing?

njuffa · January 6, 2015, 9:58pm

Your idea of using cudaMemcpy2D() for such interleaving tasks is spot on. I have used this to interleave complex data from an SOA on the host to an AOS on the device, and it worked fine:

[url]cuda - Copying data to "cufftComplex" data struct? - Stack Overflow

Be warned though that 2D copies with small strides are slow, as the DMA engines that perform the copies are not optimized for this case. I seem to recall that the throughput for the above scenario was less than 1/20 of the throughput for 1D copies of large contiguous blocks. The performance many differ based on GPU and presumably differs based on stride, but I did not perform comprehensive experiments to find out.

robosmith · January 6, 2015, 10:59pm

Thanks for your input.

I see your example uses cudaMemcpy2D. Is that preferable to cuMemcpy2D?

Would one be faster than the other for small pitches?

njuffa · January 6, 2015, 11:25pm

I always use the CUDA runtime API. I was, quite literally, the first user of the CUDA runtime API, and never used the driver API ever again. Performance-wise it should make no difference whether you use cudaMemcpu2D() [runtime API] or cuMemcpy2D() [driver API], the actual work is accomplished by the same hardware DMA engine regardless of which API is used.

[Later:] To avoid misunderstandings: I did not create the CUDA runtime API, I was just the first engineer at NVIDIA to use it after it had been created by my colleague, back in 2005. He needed somebody to take this brand new interface for a spin, and so I did. Ancient history …

robosmith · January 7, 2015, 4:20pm

This was actually my first attempt to use the driver API. I just happened to find it first while looking for a more flexible memcpy, not realizing that the runtime API had an equivalent function.

njuffa · January 7, 2015, 5:41pm

Best I am aware, most of the functionality in the run-time and driver APIs is identical and there is generally no performance reason not to use the run-time API, as it is really a thin wrapper around CUDA driver functionality. As I understand it, the driver API can offer increased control and flexibility in some aspects of CUDA, and I think the manipulation of kernels (loading, compiling, launching) is one such area. The driver API is apparently providing enough incremental advantages to some CUDA developers that it has never been abandoned despite the fact that most CUDA programmers use the run-time API.

robosmith · January 7, 2015, 5:40pm

Unfortunately, the cudaMemcpy2D interleaved copying is so much slower, it negates any speedup from coalescing and is much slower overall.

I think it would be really advantageous, if there were an interleaving cudaMemcpy which was optimized for small strides. Seems like that would be useful in a lot of applications.

njuffa · January 7, 2015, 7:04pm

As I mentioned, the host<->device transfers are performed by hardware (DMA engines), and this is what primarily determines performance. Any software overhead is likely completely minimized at this point.

Improving hardware performance typically involves real costs due to increases in silicon real estate, so any such modifications tend to face high hurdles to justify the additional cost. As far as I am aware, this particular copy performance issue has not come up often at all. You may consider filing a CUDA enhancement request, via the bug reporting form linked from the registered developer website. Please prefix the subject line with "RFE: " to mark it as an enhancement request rather than an actual bug.

For the time being, if adjusting the host-side data structures isn’t an option, I would suggest trying to re-arrange the data in the desired way using a kernel running on the GPU.

robosmith · January 7, 2015, 8:05pm

I tried interleaving the data on the host, and while faster than cudaMemcpy2D, it is slower than my original non-interleaved implementation.

It seems that using a kernel would only incur the non-coalesced global memory access I’m trying to avoid in a different place with the added overhead of an additional kernel launch.

allanmac · January 7, 2015, 8:17pm

@robosmith, as @njuffa hints, if it’s just a simple stride/interleave then you could perform several properly coalesced loads, one for each host memory array type, and then interleave them in the kernel. Then you write several coalesced stores. Or just do your image processing in the same kernel.

If you want to pull the data from the host and write it out in interleaved format so it can be loaded by other kernels (more than once) then find the right number of loads to perform in order to write out native memory transactions.

For example, if you have an array of 32-bit words and 2 1-byte arrays and want to interleave and pack them into a 6-byte structure then your kernel should probably load enough array data to perform 12-byte (64-bit + 32-bit) or 24-byte (128-bit + 64-bit ) stores.

robosmith · January 7, 2015, 9:59pm

allanmac:

@robosmith, as @njuffa hints, if it’s just a simple stride/interleave then you could perform several properly coalesced loads, one for each host memory array type, and then interleave them in the kernel. Then you write several coalesced stores. Or just do your image processing in the same kernel.

If you want to pull the data from the host and write it out in interleaved format so it can be loaded by other kernels (more than once) then find the right number of loads to perform in order to write out native memory transactions.

For example, if you have an array of 32-bit words and 2 1-byte arrays and want to interleave and pack them into a 6-byte structure then your kernel should probably load enough array data to perform 12-byte (64-bit + 32-bit) or 24-byte (128-bit + 64-bit ) stores.

My original implementation has several coalesced loads and stores.

My goal is to reduce them to one coalesced load and one coalesced store, which I could do with properly interleaved data.

However, right now it appears to take more time to interleave the data than is saved.

njuffa · January 7, 2015, 10:05pm

This is entirely possible. Your total bandwidth requirements may well be higher if you have to touch the data multiple times to massage it into an optimized layout. If your code is already bandwidth constrained, this would lead to decreased performance.

Due to hardware improvements over the years, modern GPUs often do quite well without fully coalesced accesses. Depending on the GPU you use, you may be able to squeeze out some more performance by helping the compiler generated more loads through the texture path. The first line source code adjustments for this would be appropriate use of ‘const’ and ‘restrict’ modifiers for pointer arguments. See the Best Practices Guide.

It is really hard to provide more than vague analysis and/or suggestions without thorough knowledge of the code and its current performance characteristics. My standard advice these days is to let the CUDA profiler guide optimization efforts.

Topic		Replies	Views
CUDA image processing Accelaration tips anyone? CUDA Programming and Performance	20	6052	November 16, 2010
How would you do this? CUDA Programming and Performance	12	4466	August 5, 2008
Maximizing Unified Memory Performance in CUDA Technical Blog	18	1143	May 14, 2019
Dynamic Heap initialization CUDA Programming and Performance	12	249	June 24, 2024
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5068	September 6, 2008
Is cudaMemcpy() real-time safe? CUDA Programming and Performance cuda	11	459	March 30, 2024
Copies between CPU and GPU CUDA Programming and Performance	8	5324	November 3, 2009
CUDA thread in background? CUDA Programming and Performance	10	15973	February 19, 2010
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2155	January 18, 2023
Using Shared Memory in CUDA C/C++ Technical Blog	36	1906	October 8, 2020

Memory copy for max coalescing

Related topics