CHECK(cudaMemcpy) performance issues

Hello,

I am using TensorRT sample/common library (/usr/src/tensorrt/samples/common) BufferManager to copy data into and outside the device to run a deeplearning model on Jetson AGX platform. My problem occurs when I run two different models on two std::threads. (One has the problematic library included the other one uses Cuda enabled OpenCV implementation of a superresolution algorithm, so it probably copies data differently. opencv_contrib/dnn_superres.cpp at 4.x · opencv/opencv_contrib · GitHub)

I noticed performance spikes and delved into it. What I’ve found is that this line of code is generating random 30ms spikes. Both when reading from and writing to the device (the memcpyType 1 or 2)

CHECK(cudaMemcpy(dstPtr, srcPtr, byteSize, memcpyType)); TensorRT/buffers.h at main · NVIDIA/TensorRT · GitHub

I placed chrono timers before/after the line and printed the difference. I think it is related with clashing resource allocation from two models running simultaneously. My questions are:

  1. What is the CHECK doing, couldn’t find the description on the internet because it is such a generic name. It does not change anything when I remove it. Still some random performance issues.
  2. What might be the cause of such high execution times of this function?
  3. How can I resolve it?

I’d appreciate if you do not copy random links from Nvidia sources, as moderators from other sectors of this forum do.

Thanks in advance,
Cem

  1. The CHECK is almost certainly a function or macro that is checking the return value of the cuda runtime API call (cudaMemcpy) and doing something sensible if the return value indicates an error of some sort (e.g. printing something to the console, or exiting the program, or something like that.) Removing it isn’t going to change the functional behavior of the cudaMemcpy call.

  2. When running CUDA operations in multiple threads, there is contention within the CUDA runtime API to perform various operations. This isn’t well documented, although you can find evidence and reports of it here on these forums. I’m not certain that that would cause “random 30ms spikes” but it will certainly create some variability in multi-threaded applications. cudaMemcpy also happens to be a blocking and synchronizing call, which means it will force all CUDA activity issued to the device to complete, before it commences the copy operation and before it returns control to the host thread. If you are issuing work in multiple threads, then the cudaMemcpy call will wait for all previously issued work to complete (whether in its own thread or another thread) before proceeding. If I had to guess, this is what I would guess. You are issuing work “randomly”, and “randomly” the issuance of the cudaMemcpy call happens to line up with a sequence of work that was “just issued” and so that call waits.

  3. I’m not aware of any simple or obvious way to resolve it, and I don’t really know the cause, it’s just speculation. Obviously, if multi-threading is an issue here, then doing/issuing work from one thread may improve things. There may also be Jetson AGX-specific behavior here that I am not aware of. In that case, asking on the Jetson AGX sub-forum may get you more informative answers. Profiling your application and studying the reported activity around the offending call might also shed some light on it.

Thank you very much for the detailed response. I’ve found one of your answers in stackoverflow, including it for further reference. cuda - About cudaMemcpyAsync Function - Stack Overflow

I tried to use a new stream and cudaMemcpyAsync() function. The performance issue of the Host <–> Device transfer is gone. But now my inference times have gone up.

To be precise the execution of IExecutionContext was taking 45 to 80ms randomly before, now it is taking 80+ms. This is probably due to reduced overhead from memCopy reflecting to conflicting resource usage inside the GPU.

From what I observe, due to multiple-threads, the execution of faster (superresolution, SR) network starts and finishes inside the execution time of slower (object detection, OD) network. If I disable SR network, the execution of OD network is back to its original 45ms. So it is certain that there is resource conflict here.

Edit: I tried using EnqueueV2 and cudaStreamSynchronize instead of executeV2. Inference time didn’t change.
Edit2: To be more precise
//time1
copyInputToDeviceAsync
EnqueueV2
copyOutputToHostAsync
//time2
cudaStreamSynchronize(this->mStream);
//time3

time2-time1 is 5 ms, time3-time2 is 75+ ms instead of ExecuteV2 being 80+ms itself. I think this is due to Enqueue and Async copy being non-blocking. (remind you whole execution time is <45ms for one network present)

More resources on multi-thread Synchronize calls for further reference, though this doesn’t explain the reason for increased execution times for two separate streams. multithreading - cudaStreamSynchronize behavior under multiple threads - Stack Overflow

Is there any tool or library to probe the performance or allocation of GPU resources for Jetson AGX?
Edit3: Sadly nvprof is not available for aarch64 it seems. https://on-demand.gputechconf.com/gtc/2014/presentations/S4158-cuda-streams-best-practices-common-pitfalls.pdf