Questions about efficient memory management for TensorRT on TX2

Beerend · November 7, 2018, 3:28pm

Hi all,

I am working on Jetson TX2 on a computer vision application which we would like to run as a gstreamer pipeline. Since DeepStream is still lagging behind at version 1.5 for this platform I have basically implemented something like DeepStream 2’s nvinfer plugin using standard gstreamer and TensorRT.

With help from this this NVIDIA repository I managed to implement my own custom plugin in which I can configure all the things I need.

However, I am not very enthusiastic about the cudaMemcpyAsync’s that are used in this particular function, i.e. to give input to the inference engine and get back output.

void ObjectDetector::runInference() {
    util::Logger log("ObjectDetector::runInference");

    NV_CUDA_CHECK(cudaMemcpyAsync(trt_input_gpu, trt_input_cpu.data(), trt_input_cpu.size() * sizeof(float),
				cudaMemcpyHostToDevice, cuda_stream));

    trt_context->enqueue(batch_size, &trt_input_gpu, cuda_stream, nullptr);

    NV_CUDA_CHECK(cudaMemcpyAsync(trt_output_cpu.data(), trt_output_gpu, trt_output_cpu.size() * sizeof(float),
				cudaMemcpyDeviceToHost, cuda_stream));

    cudaStreamSynchronize(cuda_stream);
}

Since the Jetson “shares” its memory between CPU and GPU it seemed a bit useless to copy around all inputs and outputs. I stumbled upon cudaMallocManaged and this seemed like a good solution. So I just use cudaMallocManaged and some calls to cudaDeviceSynchronize and I end up with:

void ObjectDetector::runInference() {
    util::Logger log("ObjectDetector::runInference");

    trt_context->enqueue(batch_size, &trt_input_gpu, cuda_stream, nullptr);

    cudaStreamSynchronize(cuda_stream);
    cudaDeviceSynchronize();
}

My plugin was working for my simple test-pipeline.
Then at one point I decided that I wanted multiple inference engines in my gstreamer pipeline so I did this and then I started having runtime problems -the fun ones: Bus Error, Segmenfation fault- and after some debugging I found that I wasn’t able to dereference some pointers to cudaMallocManaged-allocated memory, even though it was successfully allocated.

After some more searching and reading I found that for Jetson TX2 concurrentManagedAccess = 0 and I assume that this is the culprit.

cudaMemcpy(Async) works but I would like to avoid it because if I understand it correctly I am copying around data in the same physical memory and this looks like useless work and an obvious spot for optimization.

So finally what I would like to ask is:

is cudaMemcpy(Async) really the recommended way and should I just avoid touching the data on the CPU?
can I still use cudaMallocManaged and protect it using some synchronization primitives? (note that the code is running in different threads in different gstreamer plugins)
would it be more efficient to use other methods like mmap?

Thanks in advance for any helpful comments.
Beerend

NVES · November 7, 2018, 4:05pm

Hello,

I think this question will get more feedback on CUDA Programming and Performance (https://devtalk.nvidia.com/default/board/57/cuda-programming-and-performance/) or Jetson & Embedded Systems (https://devtalk.nvidia.com/default/board/139/jetson-embedded-systems/) forums.

Beerend · November 7, 2018, 4:15pm

Thanks for your reply. Can you move the topic as a moderator or just I just re-post?

TomNVIDIA · November 7, 2018, 9:09pm

Moving this to the CUDA Programming and Performance forum.

Robert_Crovella · November 7, 2018, 10:22pm

when using cudaMallocManaged with a GPU where concurrentManagedAccess=0, its necessary to ensure that after any kernel launch, only the GPU touches the data, until the next cudaDeviceSynchronize(), thereafter the CPU can access the data again (until the next kernel launch).

If you violate this rule you will get seg faults.

So one approach would be to make sure you are not doing that. In a multithreaded scenario this could be challenging, without a careful design pattern.

Another approach would be to investigate using cudaHostAlloc instead of cudaMallocManaged. This returns a pointer to the memory which can be used either in CPU code or GPU code or both, with no synchronization needed. In this respect cudaMallocManaged and cudaHostAlloc both allow for avoidance of copying in a Jetson environment -where the memory is physically unified. However there are detail differences with respect to things like cache behavior.

It is correct that TX2 doesn’t support concurrent managed access:

[url]https://devtalk.nvidia.com/default/topic/1015688/jetson-tx2/on-demand-paging/[/url]

Beerend · November 8, 2018, 8:13am

Hi Robert,

Thanks for your reply. This clarifies things a little and I believe it solves my problem.

I checked cudaHostAlloc with the cudaHostAllocMapped flag. Next I obtain a device pointer using cudaHostGetDevicePointer and both pointer appear to be the same so this is pretty cool.

NV_CUDA_CHECK(cudaHostAlloc(&trt_output_cpu_ptr, output_size, cudaHostAllocMapped));
NV_CUDA_CHECK(cudaHostGetDevicePointer(&trt_output_gpu, trt_output_cpu_ptr, 0));
std::cout << "[" << std::this_thread::get_id() << "] Allocated output cpu_ptr " << trt_output_cpu_ptr << " and obtained gpu_ptr" << trt_output_gpu << std::endl;

My application ensures that the GPU input memory will only be read by the GPU after the CPU has completed writing. Conversely I use cudaStreamSynchronize to ensure that the GPU has completed its work before I start to read the results on the CPU. Do I still need other calls to e.g. cudaDeviceSynchronize to fix some caching issues?

Robert_Crovella · November 8, 2018, 2:16pm

In a UVA setting, cudaHostAlloc always returns a mapped pointer, which is the same on both host and device. In a 64-bit OS, you are always in a UVA setting. So the call to get the device pointer is really unnecessary as you’ve figured out.

I don’t believe there is anything you can do about the differences in caching behavior between cudaHostAlloc and cudaMallocManaged in a jetson setting either with synchronization functions or anything else. I’m not an expert on Jetson TX2 so you may wish to ask this very specific question on the TX2 forum, or just do some google searching.

solarflarefx · April 21, 2020, 2:15am

@Robert_Crovella I just wanted to confirm that if only using a single inference engine, then @Beerend’s implementation here is valid and should be without faults.

Topic		Replies	Views
Performance issues after refactoring CUDA code to avoid managed memory CUDA Programming and Performance jetson	5	71	November 19, 2024
Asynchronous memory transfer on Jetson TX1 Jetson TX1	10	1618	October 18, 2021
Optimising GPU and CPU memory transfer time (CUDA/Hardware)? CUDA Programming and Performance hw , cuda	8	4145	January 7, 2022
Jetson TX2 cudaMalloc() failed with error all CUDA-capable devices are busy or unavailable Jetson TX2 cuda	9	1039	September 13, 2023
RE: Performance issues after refactoring CUDA code to avoid managed memory Jetson AGX Xavier cuda	4	39	November 25, 2024
TensorRT multi stream TensorRT	3	2649	February 29, 2024
nvidia-smi and exclusive compute mode Jetson TX2	10	5586	October 18, 2021
Help \| Mapping NVMM buffers with CUDA using EGLImageKHR and CUeglFrame, Jetson Nano running 4.5.1 Jetson Nano cuda , gstreamer	4	764	May 13, 2022
TF-TRT optimization TensorRT tensorrt , tensorflow , jetson-inference	4	4947	June 2, 2021
getPluginCreator could not find plugin BatchedNMS_TRT version 1 TensorRT	5	4006	December 23, 2020

Questions about efficient memory management for TensorRT on TX2

Related topics