Help me understand cuda memory management - why is cudaMemcpy (not) needed?

Hi there!

Recently I have been working with nvinfer1 to run inference on an object detection model using CUDA / TensorRT. I am beginner with both, as I realized today when working on the following:

So far, my implementation looked (very roughly) like this:

// Init stuff
void **buffers[2];
buffers = new void *[2];

cudaMallocHost(&buffers[0], input_size*sizeof(float));
cudaMallocHost(&buffers[1], output_size*sizeof(float));

// .....
// Inference loop
// inputArray is a float array of size input_size
cudaMemcpy(buffers[0], inputArray, input_size*sizeof(float), cudaMemcpyHostToDevice);

context->executeV2(buffers); 

std::vector<float> gpu_output(output_size);
cudaMemcpy(gpu_output.data(), buffers[1], output_size*sizeof(float), cudaMemcpyDeviceToHost);
// Now the output can be accessed, e.g. using gpu_output->at(i)

But today, when researching unified memory, I noticed that this also works:

// Init stuff
void **buffers[2];
buffers = new void *[2];

cudaMallocHost(&buffers[0], input_size*sizeof(float));
cudaMallocHost(&buffers[1], output_size*sizeof(float));

// .....
// Inference loop
// inputArray is a float array of size input_size
float *pFloat = static_cast<float *>(buffers[0]);
std::copy(inputArray, inputArray + input_size, pFloat);

context->executeV2(buffers); 

float *pFloat2 = static_cast<float *>(buffers[1]);
std::vector<float> gpu_output(pFloat2, pFloat2 + output_size);
// Now the output can be accessed, e.g. using gpu_output->at(i)

My question is: Why? To my understanding, previously I was using cudaMemcpy to load the buffers to and from the GPU. But now it seems like I can access both buffers straight from the CPU and execute inference on the GPU without copying the data around. What am I understanding wrong? Is the call to cudaMallocHost even necessary at this point?

Thank you in advance!

You may be confused about several things.

cudaMallocHost allocates pinned, host memory. Not device memory. Because it is pinned, it is accessible either from host code or device code. Because it is accessible on the “device side”, cudaMemcpy, using the transfer kind cudaMemcpyHostToDevice works:

cudaMemcpy(buffers[0], inputArray, input_size*sizeof(float), cudaMemcpyHostToDevice);

but you are not actually transferring data from host memory to GPU device memory. You are transferring data from host memory to host memory, but the kind of host memory at the destination pointer is also accessible from the device. And cudaMemcpy can work that way, if you wish. (You could also do cudaMemcpyHostToHost transfer kind, and it would work.)

Because both the source and destination for this transfer are actually in host memory, std::copy works as well.

So why use cudaMallocHost? It is because another benefit of pinned memory is that it facilitates asynchronous activity. And that is the particular reason why the TRT infer code is typically written like this. TRT wants to make use of the support for asynchronous activity, and it needs pinned host buffers to do so.

I haven’t actually tried it, but you could try to use non-pinned host buffers for buffers[]. I think it would still “work” but TRT would no longer be able to get full copy-compute overlap for the highest throughput.

To learn more about CUDA concurrency, including pinned memory, you could study section 7 in this series.

As an aside, I think you’ll generally get better help with TRT-specific question by posting on the TRT forum.

Thank you so much for the detailed answer! It helps me a lot.

I have one followup question though. If the buffers are on the Host anyway, why would I use cudaMemcpy(gpu_output.data(), buffers[1], output_size*sizeof(float), cudaMemcpyDeviceToHost); in the end? The way I understand this also just copies data from one place on the host (the buffers) to another (the gpu_output vector). Is the reason to use cudaMemcpy here that it would free the output buffer to be overriden by the GPU again while retaining the data in another place?

There isn’t any particular reason that you can’t use buffers[1] directly. The reason to copy the output somewhere else would probably be so that you can reuse the buffers for your next TRT (i.e. inference) call.