Optimising GPU and CPU memory transfer time (CUDA/Hardware)?



I’m looking for some general advice/options for a project I am working on that utilises parallel processing abilities of CUDA.

For this particular project, I have a main algorithm on the CPU which iteratively calls a few functions that I want to execute on the GPU (as they are the bottle neck of the overall algorithm). I know that the execution of these particular functions is significantly sped up on the GPU as opposed to when they are executed on the CPU (as expected), however, the overall runtime of the entire algorithm is much slower than when the algorithm and all its functions are entirely executed on the CPU. This can be explained through the memory transfer time between the CPU and GPU. If these functions were only being called once, this shouldn’t be a problem at all, but because they are executed hundreds of times until a condition is met, it means the memory transfer delay is playing a big role in the runtime of the overall algorithm.

I would classify myself as a self taught beginner at CUDA, but my knowledge of hardware is very limited. Given this, I would say that I am limited to two options to improve the overall runtime performance of my algorithm:

  1. Migrate the whole algorithm onto the GPU - now I personally don’t believe that this is the best solution because the algorithm on the CPU might have quite a few “GPU complex” calculations that could also slow down the overall runtime. It may not be best practice but I could give this a shot (it would just take a long time to implement on my behalf)…

  2. Get better hardware that optimises the memory transfer time between GPU and CPU - I am currently using a Jetson Nano and I am wondering if there is alternative hardware out there that either significantly reduces this problem or removes it altogether. Something like a hybrid or a joined GPU/CPU? I’ve done some research into this, but unfortunately my “hardware domain” limitations kick in and I haven’t progressed very far.

I would really appreciate anyone’s thoughts, feelings and/or opinions on this matter!

Re (1): This is actually something some major GPU-accelerated applications do: Move the entire computation to the GPU, even though some parts of the code may run with suboptimal performance on a GPU. That is, potential inefficiency in non-performance-critical parts is more than compensated by eliminating data transfers between host and device.

The other approach you might want to investigate is how to effectively overlap CPU computation, GPU computation, and data transfers. The nice thing is that PCIe is a full duplex interconnect: new source data can be shuffled to the GPU while previous results flow back to the host at the same time. And the GPU can concurrently transform more data. This will involve the use of CUDA streams and double (or triple) buffering and basically creates a processing pipeline: Step N data uploads to host while step N+1 data is transformed on the GPU while step N+2 data is downloaded to the device.

(2) My understanding is that NVIDIA’s embedded platforms all use physically unified memory that is accessed by both the CPU and GPU parts of the chip. Which means that there should be no need for copying data around. I have not used any of these platforms, but they all have their own dedicated sub-forums which are usually more active than the CUDA sub-forums, and Jetson-specific questions will likely receive faster and/or better answers there.


Thank you so much for your detailed response - this has really cleared up a lot of uncertainty for me!

As I am short on time, I think I will implement (1) first (very good to know that I’m now not doing something highly unrecommended), and if time permits I will look further into overlapping data transfers (I hadn’t heard of this before but it makes sense).

As for (2) I’ll follow your advice and write another post on their forums just to see if there are any other alternatives.

Thanks again!

You might find this introductory useful: How to Overlap Data Transfers in CUDA C/C++ | NVIDIA Developer Blog

1 Like

How do you manage your memory buffers? Do you use the traditional approach to allocate host memory with malloc()/new and device memory with cudaMalloc(), using cudaMemcpy() to transfer between both?

For the Jetson series, it might be useful to look into zero copy memory via cudaHostAlloc() or alternatively Unified memory via cudaMallocManaged() (assuming the latter is supported on your Jetson Nano platform). This should eliminate any memory copy overhead on your platform.

Here’s a related thread that I found. It has some links to useful resources.

1 Like

Thanks for your advice! I have traditionally been using cudaMalloc() and cudaMemcpy(), and have only recently been aware of cudaHostAlloc().

How would I know if cudaMallocManaged() is supported on my device?

Thank you for the link!

Requirements are outlined here, although I am not sure where the Nano sits regarding the, “non-embedded operating system” clause: Programming Guide :: CUDA Toolkit Documentation

1 Like

Jetson Nano supports managed memory. The reference document was already provided by cbuchner1.

In general, support for managed memory is a query-able device property that can be retrieved with e.g. cudaGetDeviceProperties and in fact is one of the items displayed in the deviceQuery sample code.

Jetson managed memory does not allow concurrent access. Therefore, you should remember to issue cudaDeviceSynchronize(), sometime after launching kernels, and before you intend to access managed data in host code.

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.