Understanding CudaApiSum results from Nsys


I am trying to analyze a network inference (YoloV4 with 608x608 input size) by using nsys on Jetson AGX platform. I utilize cuda allocators, new stream, context, enqueue and Async memCopies etc. I need to optimize run time but I cannot understand the number of calls for certain methods and I don’t understand the run times of certain methods. I tried two runs with explicit synchronization calls and without them. I have separate questions for both.

Figure1: NSYS results for 23 frames with cudaStreamSynchronize() after every frame

Figure2: NSYS results for 22 frames without cudaStreamSynchronize()

My code is as follows in psuedo form

in Main()
create a std::thread for network_infer_fnc(), pass input and output cv::Mats.
read the video frame by frame

in network_infer_fnc()
create inference class, initialize params
call class.builder(), which reads .engine file, sets flags, creates a stream with cudaStreamCreate(), creates Execution Context
in a loop controlled by Main
call class.infer()

create buffers with cudaMalloc/cudaFree cudaMallocHost/cudaFreeHost
class.processInput() -> pass input pixels to host buffer
cudaMemcpyAsync() to pass host buffer to device
cudaMemcpyAsync() to read device output buffer into host



  1. On Figure 1 Why cudaStreamSynchronize() takes 25ms on average? How can I reduce it?
  2. On Figure 1 What is cudaStreamCreateWithFlags(), why is it there? And why is it taking whopping 123ms on average? How can I get rid of it? This seems to be one time thing, so I delete the question.
  3. On Figure 1 Why is there 26 calls to cudaStreamSynchronize() whereas I called the infer method 23 times? What might be the cause of extra calls?
  4. On Figure 1 Why is there 424 cudaFree, 177 cudaMalloc, 347 cudaFreeHost, 115 cudaMallocHost calls? Shouldn’t it be the number of times I call buffer creation which is 23 times.
  5. On Figure 1 Why is it taking 3.96 ms on average to call cudaFree()? how can I reduce it?
  6. On Figure 2 it seems the Synchronization overhead is gone to cudaFreeHost()? I tried to heed the advice described in Problem6: Excessive Synchronziation here Do I need to start and finish the processing and outputting of entire video sequence inside the same method to prevent certain deconstructors calling implicit synchronization? Or what else should I do? The linked presentation falls short on suggesting a working solution.

That’s a lot of questions. I’d be glad even if you could answer some of them. I think I am not using network inference with high efficiency and I want to know the tricks to improve it.

Thanks in advance,

You might get better results for nsight systems by asking on the nsight systems forum. Jetson AGX questions may also get better help on the Jetson AGX forum.

It’s generally difficult to answer questions about the behavior of CUDA API calls that are issued from within a library call, or without access to the source code.