I am testing multiple gpu-using applications on TX2.
All of the test applications create its own stream and call cuda functions asynchronously using Stream callback. The applications periodically use GPU.
I have observed two abnormal behaviors during the test.
- It takes too long overhead to create stream. (10~20 secs)
- It sometimes takes too long overhead to copy data from host to device (15 ms)
I have attached the nvprof log files in zip file. Please import all logs using “multiple processes mode” in nvvp.
The long cudaStreamCreation calls can be observed at the beginning of each process.
You see the abnormal copy time at around 116.27s for lanechange process in nvvp.
Can anyone share some opinion what could cause this delay and how I can avoid the long delay?
nvprof_log.zip (3.31 MB)