I am using TensorRT sample/common library (/usr/src/tensorrt/samples/common) BufferManager to copy data into and outside the device to run a deeplearning model on Jetson AGX platform. My problem occurs when I run two different models on two std::threads. (One has the problematic library included the other one uses Cuda enabled OpenCV implementation of a superresolution algorithm, so it probably copies data differently. opencv_contrib/dnn_superres.cpp at 4.x · opencv/opencv_contrib · GitHub)
I noticed performance spikes and delved into it. What I’ve found is that this line of code is generating random 30ms spikes. Both when reading from and writing to the device (the memcpyType 1 or 2)
CHECK(cudaMemcpy(dstPtr, srcPtr, byteSize, memcpyType)); TensorRT/buffers.h at main · NVIDIA/TensorRT · GitHub
I placed chrono timers before/after the line and printed the difference. I think it is related with clashing resource allocation from two models running simultaneously. My questions are:
- What is the CHECK doing, couldn’t find the description on the internet because it is such a generic name. It does not change anything when I remove it. Still some random performance issues.
- What might be the cause of such high execution times of this function?
- How can I resolve it?
I’d appreciate if you do not copy random links from Nvidia sources, as moderators from other sectors of this forum do.
Thanks in advance,
Thank you very much for the detailed response. I’ve found one of your answers in stackoverflow, including it for further reference. cuda - About cudaMemcpyAsync Function - Stack Overflow
I tried to use a new stream and
cudaMemcpyAsync() function. The performance issue of the Host <–> Device transfer is gone. But now my inference times have gone up.
To be precise the execution of
IExecutionContext was taking 45 to 80ms randomly before, now it is taking 80+ms. This is probably due to reduced overhead from memCopy reflecting to conflicting resource usage inside the GPU.
From what I observe, due to multiple-threads, the execution of faster (superresolution, SR) network starts and finishes inside the execution time of slower (object detection, OD) network. If I disable SR network, the execution of OD network is back to its original 45ms. So it is certain that there is resource conflict here.
Edit: I tried using EnqueueV2 and cudaStreamSynchronize instead of executeV2. Inference time didn’t change.
Edit2: To be more precise
time2-time1 is 5 ms, time3-time2 is 75+ ms instead of ExecuteV2 being 80+ms itself. I think this is due to Enqueue and Async copy being non-blocking. (remind you whole execution time is <45ms for one network present)
More resources on multi-thread Synchronize calls for further reference, though this doesn’t explain the reason for increased execution times for two separate streams. multithreading - cudaStreamSynchronize behavior under multiple threads - Stack Overflow
Is there any tool or library to probe the performance or allocation of GPU resources for Jetson AGX?
Edit3: Sadly nvprof is not available for aarch64 it seems. https://on-demand.gputechconf.com/gtc/2014/presentations/S4158-cuda-streams-best-practices-common-pitfalls.pdf
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.