Optimising GPU and CPU memory transfer time (CUDA/Hardware)?

Sophie · December 22, 2021, 6:35am

Hello!

I’m looking for some general advice/options for a project I am working on that utilises parallel processing abilities of CUDA.

For this particular project, I have a main algorithm on the CPU which iteratively calls a few functions that I want to execute on the GPU (as they are the bottle neck of the overall algorithm). I know that the execution of these particular functions is significantly sped up on the GPU as opposed to when they are executed on the CPU (as expected), however, the overall runtime of the entire algorithm is much slower than when the algorithm and all its functions are entirely executed on the CPU. This can be explained through the memory transfer time between the CPU and GPU. If these functions were only being called once, this shouldn’t be a problem at all, but because they are executed hundreds of times until a condition is met, it means the memory transfer delay is playing a big role in the runtime of the overall algorithm.

I would classify myself as a self taught beginner at CUDA, but my knowledge of hardware is very limited. Given this, I would say that I am limited to two options to improve the overall runtime performance of my algorithm:

Migrate the whole algorithm onto the GPU - now I personally don’t believe that this is the best solution because the algorithm on the CPU might have quite a few “GPU complex” calculations that could also slow down the overall runtime. It may not be best practice but I could give this a shot (it would just take a long time to implement on my behalf)…
Get better hardware that optimises the memory transfer time between GPU and CPU - I am currently using a Jetson Nano and I am wondering if there is alternative hardware out there that either significantly reduces this problem or removes it altogether. Something like a hybrid or a joined GPU/CPU? I’ve done some research into this, but unfortunately my “hardware domain” limitations kick in and I haven’t progressed very far.

I would really appreciate anyone’s thoughts, feelings and/or opinions on this matter!

njuffa · December 22, 2021, 6:59am

Re (1): This is actually something some major GPU-accelerated applications do: Move the entire computation to the GPU, even though some parts of the code may run with suboptimal performance on a GPU. That is, potential inefficiency in non-performance-critical parts is more than compensated by eliminating data transfers between host and device.

The other approach you might want to investigate is how to effectively overlap CPU computation, GPU computation, and data transfers. The nice thing is that PCIe is a full duplex interconnect: new source data can be shuffled to the GPU while previous results flow back to the host at the same time. And the GPU can concurrently transform more data. This will involve the use of CUDA streams and double (or triple) buffering and basically creates a processing pipeline: Step N data uploads to host while step N+1 data is transformed on the GPU while step N+2 data is downloaded to the device.

(2) My understanding is that NVIDIA’s embedded platforms all use physically unified memory that is accessed by both the CPU and GPU parts of the chip. Which means that there should be no need for copying data around. I have not used any of these platforms, but they all have their own dedicated sub-forums which are usually more active than the CUDA sub-forums, and Jetson-specific questions will likely receive faster and/or better answers there.

Sophie · December 22, 2021, 10:23pm

Thank you so much for your detailed response - this has really cleared up a lot of uncertainty for me!

As I am short on time, I think I will implement (1) first (very good to know that I’m now not doing something highly unrecommended), and if time permits I will look further into overlapping data transfers (I hadn’t heard of this before but it makes sense).

As for (2) I’ll follow your advice and write another post on their forums just to see if there are any other alternatives.

Thanks again!

rs277 · December 23, 2021, 12:31am

You might find this introductory useful: How to Overlap Data Transfers in CUDA C/C++ | NVIDIA Technical Blog

cbuchner1 · December 23, 2021, 10:07am

How do you manage your memory buffers? Do you use the traditional approach to allocate host memory with malloc()/new and device memory with cudaMalloc(), using cudaMemcpy() to transfer between both?

For the Jetson series, it might be useful to look into zero copy memory via cudaHostAlloc() or alternatively Unified memory via cudaMallocManaged() (assuming the latter is supported on your Jetson Nano platform). This should eliminate any memory copy overhead on your platform.

Here’s a related thread that I found. It has some links to useful resources.
https://forums.developer.nvidia.com/t/jetson-nano-device-local-memory-specifications/73524/6

Sophie · December 24, 2021, 2:30am

Thanks for your advice! I have traditionally been using cudaMalloc() and cudaMemcpy(), and have only recently been aware of cudaHostAlloc().

How would I know if cudaMallocManaged() is supported on my device?

Thank you for the link!

rs277 · December 24, 2021, 6:32pm

Requirements are outlined here, although I am not sure where the Nano sits regarding the, “non-embedded operating system” clause: CUDA C++ Programming Guide

Robert_Crovella · December 24, 2021, 6:50pm

Jetson Nano supports managed memory. The reference document was already provided by cbuchner1.

In general, support for managed memory is a query-able device property that can be retrieved with e.g. cudaGetDeviceProperties and in fact is one of the items displayed in the deviceQuery sample code.

Jetson managed memory does not allow concurrent access. Therefore, you should remember to issue cudaDeviceSynchronize(), sometime after launching kernels, and before you intend to access managed data in host code.

system · January 7, 2022, 6:50pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Best hardware options to reduce GPU and CPU memory transfer time? Jetson Nano	6	1049	January 19, 2022
Performance issues after refactoring CUDA code to avoid managed memory CUDA Programming and Performance jetson	5	57	November 19, 2024
CPU operation is very slow on memory allocated by cudaMallocHost Jetson TX2	13	1722	October 18, 2021
Asynchronous memory transfer on Jetson TX1 Jetson TX1	10	1618	October 18, 2021
RE: Performance issues after refactoring CUDA code to avoid managed memory Jetson AGX Xavier cuda	4	36	November 25, 2024
Questions about efficient memory management for TensorRT on TX2 CUDA Programming and Performance	8	2006	October 12, 2021
Maximizing Unified Memory Performance in CUDA Technical Blog	18	1209	May 14, 2019
How to Overlap Data Transfers in CUDA C/C++ Technical Blog	23	2211	January 18, 2023
A little help with Multi-GPU example please :) How do I pass data to each GPU? CUDA Programming and Performance	8	28003	March 4, 2012
cudaMallocManaged on jetson devices CUDA Programming and Performance cuda , jetson	3	1692	March 6, 2023

Optimising GPU and CPU memory transfer time (CUDA/Hardware)?

Related topics