Hi all,
I am working on Jetson TX2 on a computer vision application which we would like to run as a gstreamer pipeline. Since DeepStream is still lagging behind at version 1.5 for this platform I have basically implemented something like DeepStream 2’s nvinfer plugin using standard gstreamer and TensorRT.
With help from this this NVIDIA repository I managed to implement my own custom plugin in which I can configure all the things I need.
However, I am not very enthusiastic about the cudaMemcpyAsync’s that are used in this particular function, i.e. to give input to the inference engine and get back output.
void ObjectDetector::runInference() {
util::Logger log("ObjectDetector::runInference");
NV_CUDA_CHECK(cudaMemcpyAsync(trt_input_gpu, trt_input_cpu.data(), trt_input_cpu.size() * sizeof(float),
cudaMemcpyHostToDevice, cuda_stream));
trt_context->enqueue(batch_size, &trt_input_gpu, cuda_stream, nullptr);
NV_CUDA_CHECK(cudaMemcpyAsync(trt_output_cpu.data(), trt_output_gpu, trt_output_cpu.size() * sizeof(float),
cudaMemcpyDeviceToHost, cuda_stream));
cudaStreamSynchronize(cuda_stream);
}
Since the Jetson “shares” its memory between CPU and GPU it seemed a bit useless to copy around all inputs and outputs. I stumbled upon cudaMallocManaged and this seemed like a good solution. So I just use cudaMallocManaged and some calls to cudaDeviceSynchronize and I end up with:
void ObjectDetector::runInference() {
util::Logger log("ObjectDetector::runInference");
trt_context->enqueue(batch_size, &trt_input_gpu, cuda_stream, nullptr);
cudaStreamSynchronize(cuda_stream);
cudaDeviceSynchronize();
}
My plugin was working for my simple test-pipeline.
Then at one point I decided that I wanted multiple inference engines in my gstreamer pipeline so I did this and then I started having runtime problems -the fun ones: Bus Error, Segmenfation fault- and after some debugging I found that I wasn’t able to dereference some pointers to cudaMallocManaged-allocated memory, even though it was successfully allocated.
After some more searching and reading I found that for Jetson TX2 concurrentManagedAccess = 0 and I assume that this is the culprit.
cudaMemcpy(Async) works but I would like to avoid it because if I understand it correctly I am copying around data in the same physical memory and this looks like useless work and an obvious spot for optimization.
So finally what I would like to ask is:
- is cudaMemcpy(Async) really the recommended way and should I just avoid touching the data on the CPU?
- can I still use cudaMallocManaged and protect it using some synchronization primitives? (note that the code is running in different threads in different gstreamer plugins)
- would it be more efficient to use other methods like mmap?
Thanks in advance for any helpful comments.
Beerend