Help me understand cuda memory management - why is cudaMemcpy (not) needed?

user124509 · February 7, 2022, 4:22pm

Hi there!

Recently I have been working with nvinfer1 to run inference on an object detection model using CUDA / TensorRT. I am beginner with both, as I realized today when working on the following:

So far, my implementation looked (very roughly) like this:

// Init stuff
void **buffers[2];
buffers = new void *[2];

cudaMallocHost(&buffers[0], input_size*sizeof(float));
cudaMallocHost(&buffers[1], output_size*sizeof(float));

// .....
// Inference loop
// inputArray is a float array of size input_size
cudaMemcpy(buffers[0], inputArray, input_size*sizeof(float), cudaMemcpyHostToDevice);

context->executeV2(buffers); 

std::vector<float> gpu_output(output_size);
cudaMemcpy(gpu_output.data(), buffers[1], output_size*sizeof(float), cudaMemcpyDeviceToHost);
// Now the output can be accessed, e.g. using gpu_output->at(i)

But today, when researching unified memory, I noticed that this also works:

// Init stuff
void **buffers[2];
buffers = new void *[2];

cudaMallocHost(&buffers[0], input_size*sizeof(float));
cudaMallocHost(&buffers[1], output_size*sizeof(float));

// .....
// Inference loop
// inputArray is a float array of size input_size
float *pFloat = static_cast<float *>(buffers[0]);
std::copy(inputArray, inputArray + input_size, pFloat);

context->executeV2(buffers); 

float *pFloat2 = static_cast<float *>(buffers[1]);
std::vector<float> gpu_output(pFloat2, pFloat2 + output_size);
// Now the output can be accessed, e.g. using gpu_output->at(i)

My question is: Why? To my understanding, previously I was using cudaMemcpy to load the buffers to and from the GPU. But now it seems like I can access both buffers straight from the CPU and execute inference on the GPU without copying the data around. What am I understanding wrong? Is the call to cudaMallocHost even necessary at this point?

Thank you in advance!

Robert_Crovella · February 7, 2022, 4:43pm

You may be confused about several things.

cudaMallocHost allocates pinned, host memory. Not device memory. Because it is pinned, it is accessible either from host code or device code. Because it is accessible on the “device side”, cudaMemcpy, using the transfer kind cudaMemcpyHostToDevice works:

cudaMemcpy(buffers[0], inputArray, input_size*sizeof(float), cudaMemcpyHostToDevice);

but you are not actually transferring data from host memory to GPU device memory. You are transferring data from host memory to host memory, but the kind of host memory at the destination pointer is also accessible from the device. And cudaMemcpy can work that way, if you wish. (You could also do cudaMemcpyHostToHost transfer kind, and it would work.)

Because both the source and destination for this transfer are actually in host memory, std::copy works as well.

So why use cudaMallocHost? It is because another benefit of pinned memory is that it facilitates asynchronous activity. And that is the particular reason why the TRT infer code is typically written like this. TRT wants to make use of the support for asynchronous activity, and it needs pinned host buffers to do so.

I haven’t actually tried it, but you could try to use non-pinned host buffers for buffers[]. I think it would still “work” but TRT would no longer be able to get full copy-compute overlap for the highest throughput.

To learn more about CUDA concurrency, including pinned memory, you could study section 7 in this series.

As an aside, I think you’ll generally get better help with TRT-specific question by posting on the TRT forum.

user124509 · February 7, 2022, 5:08pm

Thank you so much for the detailed answer! It helps me a lot.

I have one followup question though. If the buffers are on the Host anyway, why would I use cudaMemcpy(gpu_output.data(), buffers[1], output_size*sizeof(float), cudaMemcpyDeviceToHost); in the end? The way I understand this also just copies data from one place on the host (the buffers) to another (the gpu_output vector). Is the reason to use cudaMemcpy here that it would free the output buffer to be overriden by the GPU again while retaining the data in another place?

Robert_Crovella · February 7, 2022, 5:18pm

There isn’t any particular reason that you can’t use buffers[1] directly. The reason to copy the output somewhere else would probably be so that you can reuse the buffers for your next TRT (i.e. inference) call.

Topic		Replies	Views
Questions about efficient memory management for TensorRT on TX2 CUDA Programming and Performance	8	2143	October 12, 2021
cudaMemcpy to non-pinned memory CUDA Programming and Performance	5	1858	October 12, 2021
CudaMallocManaged and CudaMemcpy CUDA Programming and Performance	2	18818	May 29, 2016
Simple cudaMallocHost beginner question CUDA Programming and Performance	5	2811	September 29, 2008
Can I create a pinned memory buffer to support overlapping compute/copy without cudaMallocHost overhead CUDA Programming and Performance cuda	13	1018	November 3, 2020
cudaMallocHost How to use CUDA Programming and Performance	6	35745	April 26, 2012
About copy-time for cuda in TX1 Jetson TX1	5	698	October 18, 2021
Transfer data from GPU to CPU takes too much times on TX2 Jetson TX2	5	1032	October 18, 2021
Avoid retrieve cudaMemcpy size from GPU CUDA Programming and Performance	6	135	July 19, 2025
Problems with cudaHostAlloc and cudaMemcpyAsync CUDA Programming and Performance	5	4607	February 8, 2010

Help me understand cuda memory management - why is cudaMemcpy (not) needed?

Related topics