Questions about efficient memory management for TensorRT on TX2

Hi all,

I am working on Jetson TX2 on a computer vision application which we would like to run as a gstreamer pipeline. Since DeepStream is still lagging behind at version 1.5 for this platform I have basically implemented something like DeepStream 2’s nvinfer plugin using standard gstreamer and TensorRT.

With help from this this NVIDIA repository I managed to implement my own custom plugin in which I can configure all the things I need.

However, I am not very enthusiastic about the cudaMemcpyAsync’s that are used in this particular function, i.e. to give input to the inference engine and get back output.

void ObjectDetector::runInference() {
    util::Logger log("ObjectDetector::runInference");

    NV_CUDA_CHECK(cudaMemcpyAsync(trt_input_gpu, trt_input_cpu.data(), trt_input_cpu.size() * sizeof(float),
				cudaMemcpyHostToDevice, cuda_stream));

    trt_context->enqueue(batch_size, &trt_input_gpu, cuda_stream, nullptr);

    NV_CUDA_CHECK(cudaMemcpyAsync(trt_output_cpu.data(), trt_output_gpu, trt_output_cpu.size() * sizeof(float),
				cudaMemcpyDeviceToHost, cuda_stream));

    cudaStreamSynchronize(cuda_stream);
}

Since the Jetson “shares” its memory between CPU and GPU it seemed a bit useless to copy around all inputs and outputs. I stumbled upon cudaMallocManaged and this seemed like a good solution. So I just use cudaMallocManaged and some calls to cudaDeviceSynchronize and I end up with:

void ObjectDetector::runInference() {
    util::Logger log("ObjectDetector::runInference");

    trt_context->enqueue(batch_size, &trt_input_gpu, cuda_stream, nullptr);

    cudaStreamSynchronize(cuda_stream);
    cudaDeviceSynchronize();
}

My plugin was working for my simple test-pipeline.
Then at one point I decided that I wanted multiple inference engines in my gstreamer pipeline so I did this and then I started having runtime problems -the fun ones: Bus Error, Segmenfation fault- and after some debugging I found that I wasn’t able to dereference some pointers to cudaMallocManaged-allocated memory, even though it was successfully allocated.

After some more searching and reading I found that for Jetson TX2 concurrentManagedAccess = 0 and I assume that this is the culprit.

cudaMemcpy(Async) works but I would like to avoid it because if I understand it correctly I am copying around data in the same physical memory and this looks like useless work and an obvious spot for optimization.

So finally what I would like to ask is:

  • is cudaMemcpy(Async) really the recommended way and should I just avoid touching the data on the CPU?
  • can I still use cudaMallocManaged and protect it using some synchronization primitives? (note that the code is running in different threads in different gstreamer plugins)
  • would it be more efficient to use other methods like mmap?

Thanks in advance for any helpful comments.
Beerend

Hello,

I think this question will get more feedback on CUDA Programming and Performance (https://devtalk.nvidia.com/default/board/57/cuda-programming-and-performance/) or Jetson & Embedded Systems (https://devtalk.nvidia.com/default/board/139/jetson-embedded-systems/) forums.

Thanks for your reply. Can you move the topic as a moderator or just I just re-post?

Moving this to the CUDA Programming and Performance forum.

when using cudaMallocManaged with a GPU where concurrentManagedAccess=0, its necessary to ensure that after any kernel launch, only the GPU touches the data, until the next cudaDeviceSynchronize(), thereafter the CPU can access the data again (until the next kernel launch).

If you violate this rule you will get seg faults.

So one approach would be to make sure you are not doing that. In a multithreaded scenario this could be challenging, without a careful design pattern.

Another approach would be to investigate using cudaHostAlloc instead of cudaMallocManaged. This returns a pointer to the memory which can be used either in CPU code or GPU code or both, with no synchronization needed. In this respect cudaMallocManaged and cudaHostAlloc both allow for avoidance of copying in a Jetson environment -where the memory is physically unified. However there are detail differences with respect to things like cache behavior.

It is correct that TX2 doesn’t support concurrent managed access:

[url]https://devtalk.nvidia.com/default/topic/1015688/jetson-tx2/on-demand-paging/[/url]

Hi Robert,

Thanks for your reply. This clarifies things a little and I believe it solves my problem.

I checked cudaHostAlloc with the cudaHostAllocMapped flag. Next I obtain a device pointer using cudaHostGetDevicePointer and both pointer appear to be the same so this is pretty cool.

NV_CUDA_CHECK(cudaHostAlloc(&trt_output_cpu_ptr, output_size, cudaHostAllocMapped));
NV_CUDA_CHECK(cudaHostGetDevicePointer(&trt_output_gpu, trt_output_cpu_ptr, 0));
std::cout << "[" << std::this_thread::get_id() << "] Allocated output cpu_ptr " << trt_output_cpu_ptr << " and obtained gpu_ptr" << trt_output_gpu << std::endl;

My application ensures that the GPU input memory will only be read by the GPU after the CPU has completed writing. Conversely I use cudaStreamSynchronize to ensure that the GPU has completed its work before I start to read the results on the CPU. Do I still need other calls to e.g. cudaDeviceSynchronize to fix some caching issues?

In a UVA setting, cudaHostAlloc always returns a mapped pointer, which is the same on both host and device. In a 64-bit OS, you are always in a UVA setting. So the call to get the device pointer is really unnecessary as you’ve figured out.

I don’t believe there is anything you can do about the differences in caching behavior between cudaHostAlloc and cudaMallocManaged in a jetson setting either with synchronization functions or anything else. I’m not an expert on Jetson TX2 so you may wish to ask this very specific question on the TX2 forum, or just do some google searching.

@Robert_Crovella I just wanted to confirm that if only using a single inference engine, then @Beerend’s implementation here is valid and should be without faults.