Can I create a pinned memory buffer to support overlapping compute/copy without cudaMallocHost overhead

I am developing an application which takes 1D input data from the host and creates 2D output data on a GPU using CUDA. The application takes many independent 1D input data arrays to create many independent 2D output planes. For this reason, I have implemented my algorithm such that I use multiple streams to concurrently compute a few of these independent planes at a time and I use pinned memory to enable concurrent copy/execution. I have optimized my kernels to the point where the latency is being driven by the cudaMallocHost calls I use every iteration to initialize my host-side output memory space. I have confirmed this is the latency driver by first looping through all these cudaMallocHost calls, then performing all of my computation, and saw that the cudaMallocHost calls took about 5s and the following computation only took about 1.5s.

I am using a K80, my output memory size is about 1.5Gb in total between all ~200 output planes. I also attempted to initialize the output memory in one call rather than 200 separate calls, with no appreciable impact on performance.

I now have the idea of creating a pinned memory buffer to stream into from the gpu, then cleverly using some cudaStreamSynchronize calls to memcpy the pinned memory into pageable memory so I dont need to make the full output space exist in pinned memory. I implemented a version of this which had similar performance to my first implementation where all of the output was implemented using cudaMallocHost. The basic algorithm I used looked like this:

// initialize workspace for 8 streams on gpu
// allocate pinned host memory for 8 planes, call this outputBuffer
// allocate pageable host memory for 200 planes, call this out_planes

for(plane_index=0; plane_index < num_planes; plane_index++){
streamIdx = plane_index%numStreams;
// async copy 1D data into gpu using streams[streamIdx]
// perform computation in gpu
// async copy 2D data back to host PINNED memory using streams[streamIdx]
// Do mem xfer out of buffer
if((streamIdx == (numStreams/2 - 1)) || (streamIdx == (numStreams-1))){
if(buffFlag == -1){
buffFlag=0; // used to skip first entrance at plane_index == numStreams/2 -1
}
else {
for(int ii=0; ii<numStreams/2; ii++){
cudaStreamSynchronize(streams[ii+buffFlag*numStreams/2]);
memcpy(out_planes[plane_index-(numStreams/2-1)+ii], outputBuffer[ii+buffFlag*numStreams/2].mag, plane_size);
}
if(buffFlag==0)
buffFlag=1;
else
buffFlag=0;
}
}

There’s also a cleanup xfer for the the leftover planes after the plane loop ends. My idea was that after I make the async calls for all 8 streams I could tell the cpu to wait for the first half of the streams to complete and then copy their pinned output into pageable output, then make the async calls for the first half of the streams, then wait for the second half’s asyncs to finish and transfer their pinned output to the pageable space, and then so on and so on. This worked, but the performance was not any better than just using cudaMallocHost to give the full output pinned memory space.

Does anyone have advice for a better way to handle this?

Thanks in advance.

rather than using stream synchronize, for a pipelined operation I would use a CUDA callback (cudaLaunchHostFunc) to do the copying from the pinned to the pageable buffer. This will then follow stream semantics and allow you to issue work fully asynchronously.

Whether or not that will provide a performance benefit, I can’t say.

Why is this necessary? Wouldn’t two pinned buffers suffice, with usage ping-ponging between the two buffers? Or a very large allocation shared by all entities at appropriate offsets?

In general, no matter the kind of system (just CPU or heterogeneous CPU/GPU), the number of allocation / deallocation calls of any kind should be minimized for best performance. Recycling existing allocations is very much a thing.

Thanks for the reply!

This was just a test to ensure the cudaMallocHost call(s) were driving the latency, and I did test with one massive call rather than 200 separate calls and it did not make a huge difference. The current performance version of the code doesn’t do the allocation all at once at the beginning, I just wanted to compare the latency of the cudaMallocHosts on their own with the computation on its own.

This is what I am trying to implement now, but when I attempt to stream sync and memcpy from pinned to pageable it causes gpu idle time, nullifying the performance benefits I am looking for. I am going to attempt @Robert_Crovella’s suggestion tonight and see if it helps - it certainly should make the confusing logic i implemented a bit more straightforward

Your earlier post stated cudaMallocHost() takes 5 seconds. This seems high to me. Are you sure you are not including CUDA context creation time in these measurements? CUDA context creation happens lazily, typically triggered by the first CUDA API call. Does the time change if you add a cudaFree(0) at the very start of your program? This triggers CUDA context creation before execution proceeds to any allocation calls.

If the context creation is slow and this is a Linux machine, make sure the persistence daemon is enabled to keep the driver module loaded at all times. CUDA context creation time is largely a function of the total memory in the system system memory plus all for GPU memory, since all that memory must be mapped into a unified virtual address map via OS API calls. Is this a system with large system memory and/or many GPUs?

cudaMallocHost() is basically a thin wrapper around some OS API calls. Its performance is a function of the OS and the single-thread performance of the CPU. I recommend CPUs with >= 3.5 GHz base frequency for high-performance systems to minimize OS and driver overhead.

There is a previous cudaMallocHost call for the input data, so the context creation shouldn’t be an issue for this cudaMallocHost call. Thanks for the tip though, I added a cudaFree(0) to the beginning of my program. I reran the test just to make sure I had the numbers right, and I caught a bug in how I was calculating the output storage bytes to display - its actually 10.8 Gb. I apologize for the confusion there I have a bunch of different sizes that I am testing.

This is being run on a VM with just one K80 and a 128 Gib disk, and the processor clocks 2.60GHz. This is just a temporary setup though, so perhaps the performance of the cudaMallocHost function could improve on a higher performance vm?

I’m still hoping the cudaLaunchHostFunc will help me out a bit, but I probably wont be able to put that together until tomorrow, I’ll update once I test that out and if I end up doing some test on another vm. Thanks again for all the help!

I do not have hands-on experience with VMs but would be wary about overhead they add. For older processors this may have gotten worse due to the mitigation measures to address the security flaws in these CPUs. Slow processor clocks are sub-optimal as a lot of the OS APIs needed by the host side CUDA software, like memory allocators, are mostly if not completely serial activities.

[Later:]

I measured the time for CUDA context creation on two different Windows systems, one with 8 GB of system memory and one low-end GPU (total memory 9 GB), the other with 32 GB of system memory and two mid-range GPUs (total memory 45 GB). Both systems were running Intel CPUs operating at 3.7 GHz at the time of the test (systems lightly loaded so not full CPU boost).

8 GB system: CUDA context creation 0.16 seconds, 1.5 GB host alloc 0.56 seconds
32 GB system: CUDA context creation 0.24 seconds, 1.5 GB host alloc 0.50 seconds

The host allocation speed appears to be (to first order) in a linear relationship with the total amount of memory allocated, regardless of how many individual chunks are requested. This jibes with OP’s observations.

The only way I can get the time for host allocation into the multi-second range is by making the allocation large enough that swapping occured (both systems have traditional rotational mass storage devices so this is very noticeable :-).

So the 5 seconds observed by OP for creating a 1.5 GB host allocation doesn’t seem right to me, even considering the slower CPU clock.

I rewrote my buffer implementation to use cudaLaunchHostFunc to copy from the buffer to pageable memory once the stream has finished computing the plane. One issue i ran into was that I needed to create an array of parameter structs to pass to my memcpy wrapper because otherwise all of the memcpy’s would just use the final state of the single parameter struct. That wasnt a big deal, and I was able to get the code working with the stream callback. However, it seems that the callback has prevented concurrent copy and execution functionality on the device, and my overall latency remains around 5s.

A callback should obey stream semantics. With respect to the stream it is in, operations in that stream will not run concurrently, of course. However it should have no bearing on the behavior of other operations in other streams.

Just to reclarify, my original post was incorrect, the true output size is 10.5 Gb, which if I extrapolate from your data suggests about 3.5s if my vm had a faster CPU clock. I will update when i have a chance to try with another vm

Gb = giga bit. I assume you meant 10.5 GB. I will re-run on my Skylake system with 32 Gb system memory and record the time for another data point, but right now it is being hammered by something like 25 threads that take up most of the available memory. Five seconds to allocate 10.5GB of pinned host memory seems reasonable.

I don’t recall the relative efficiency of Windows vs Linux when allocating pinned host memory. The problem for the OS is that it needs to find physically contiguous regions of memory, and beyond a certain threshold (presumably a percentage of system memory), the cost of that likely rises much faster than linear and at some utilization point I would expect it to become impossible due to existing fragmentation.

In my view, creating huge pinned host memory allocations is not a good design pattern. Modern operating systems are built around the concept of virtual contiguous memory for maximum flexibility.

[Later:] For a pinned host allocation size of 10.5 GB, I measure an allocation time of 3.55 seconds, using CUDA 9.2 on Windows 10 Pro for Workstations running on an Intel Xeon W-2133 CPU operating at 3.8 GHz.

I agree conceptually, and looking closer it doesnt seem like concurrent c/e is blocked, just that it no longer tends to occur (see image for a quick nvvp screengrab of 16 streams). I tried nvprof and gmon and neither are giving me any time taken up by my callback function or the memcpy call inside of it, do you have any suggestions for how I might determine the latency of the callback function? It wouldnt let me attach multiple screengrabs, but when I varied the number of streams from 4->16 it seems to drive up the latency between stream copy-execute-copy patterns from ~ 0.1s to ~0.3s, so I am a bit confused why that would be happening. I included the 16 stream screengrab because the spacing was less consistent than the 4 stream run.

I assume by latency you mean the execution duration of the copy process, once the callback is called. For that you could simply use ordinary host based timing and print it out, or you could use nvtx based methods.

If you actually meant “latency” i.e. the time from when you scheduled the callback with cudalaunchhostfunc, until the time when the callback was actually initiated, I would use nvtx for that.

As for the number of streams, my assumption is that general overhead for using streams naturally increases with the number of streams, and would therefore try to minimize the number of streams.

With some careful balancing, operating a processing pipeline utilizing host-side double buffering and just two streams should allow near perfect overlap of compute activity with uploads and downloads (assuming a GPU with at least two copy engines). At least that has been my experience in the past. I may have missed something in the use case at hand, but that is what I would use as a design baseline.