Abnormal overhead during stream creation and data copy.

I am testing multiple gpu-using applications on TX2.

All of the test applications create its own stream and call cuda functions asynchronously using Stream callback. The applications periodically use GPU.

I have observed two abnormal behaviors during the test.

  1. It takes too long overhead to create stream. (10~20 secs)
  2. It sometimes takes too long overhead to copy data from host to device (15 ms)

I have attached the nvprof log files in zip file. Please import all logs using “multiple processes mode” in nvvp.
The long cudaStreamCreation calls can be observed at the beginning of each process.
You see the abnormal copy time at around 116.27s for lanechange process in nvvp.

Can anyone share some opinion what could cause this delay and how I can avoid the long delay?

nvprof_log.zip (3.31 MB)

Chances are, the “time” is actually spent stalling, waiting for other work that already are using the GPU to finish.

What else is the GPU doing when you see the stall? What other programs are running? How big are the kernels? Are you also using a graphical display?


I am not using any graphical display. I turned off the lightldm service and set the GPU and CPU to maximum frequency using jetson_clocks.sh during the test.
According to the nvprof log, there is no other GPU execution when the 15ms copy delay is.

I am still struggling with this problem to see why the copy time sometimes takes so much time.

One possibility is that some kernel operations may preempt the cudaMemcpy2DAsync().
I plan to capture all CPU cuda function calls using nvprof.

I am using the following commend to capture the trace.

sudo ./nvprof --profile-all-processes -o output.%p.log

Then, it only captures application launched by root after this commend.
(I need to use “sudo” because my test application needs sudo operation to assign real time priority.)

Is there any way to capture all cuda function calls globally in the system regardless of user id and when the application was launched?


Could you set nvpmodel to maxN:

sudo nvpmodel -m 0

Could you tell us which sample you are using? Or attach a sample for us debugging?