Hi there!
Recently I have been working with nvinfer1 to run inference on an object detection model using CUDA / TensorRT. I am beginner with both, as I realized today when working on the following:
So far, my implementation looked (very roughly) like this:
// Init stuff
void **buffers[2];
buffers = new void *[2];
cudaMallocHost(&buffers[0], input_size*sizeof(float));
cudaMallocHost(&buffers[1], output_size*sizeof(float));
// .....
// Inference loop
// inputArray is a float array of size input_size
cudaMemcpy(buffers[0], inputArray, input_size*sizeof(float), cudaMemcpyHostToDevice);
context->executeV2(buffers);
std::vector<float> gpu_output(output_size);
cudaMemcpy(gpu_output.data(), buffers[1], output_size*sizeof(float), cudaMemcpyDeviceToHost);
// Now the output can be accessed, e.g. using gpu_output->at(i)
But today, when researching unified memory, I noticed that this also works:
// Init stuff
void **buffers[2];
buffers = new void *[2];
cudaMallocHost(&buffers[0], input_size*sizeof(float));
cudaMallocHost(&buffers[1], output_size*sizeof(float));
// .....
// Inference loop
// inputArray is a float array of size input_size
float *pFloat = static_cast<float *>(buffers[0]);
std::copy(inputArray, inputArray + input_size, pFloat);
context->executeV2(buffers);
float *pFloat2 = static_cast<float *>(buffers[1]);
std::vector<float> gpu_output(pFloat2, pFloat2 + output_size);
// Now the output can be accessed, e.g. using gpu_output->at(i)
My question is: Why? To my understanding, previously I was using cudaMemcpy to load the buffers to and from the GPU. But now it seems like I can access both buffers straight from the CPU and execute inference on the GPU without copying the data around. What am I understanding wrong? Is the call to cudaMallocHost
even necessary at this point?
Thank you in advance!