yolo3 on Xavier system influence other application use cuda

Hello,
We use Autoware as the test program on E3550 and compare to another system, yolo cudaMemocpy always takes more than 100ms on E3550 and influence other application gpu computing as the nvprof show.
(https://drive.google.com/file/d/1htcPqccQeLo-uUM7ngs_lixjWdrIu-Rj/view?usp=sharing)
But the same test program on another system just take less than 30ms. Is any information about the root cause or suggestion to fine tune, thanks.

Below is our test program information and nvidia visual profile export file for anlyze:
Test Program Versions:
ndt_matching_gpu of Autoware 1.10
vision detector of yolo3 includes in Autoware
bag file contain camera image
yolo3 weight file from: (https://pjreddie.com/media/files/yolov3.weights)

Configuration and Test Step:
1.Use nvprof command to record all processes using gpu
2.Follow the Autoware user guide to start the runtime manager and select the bag with camera vision.
3.Enable map, sensing page functions.
4.Enable ndt_matching in computing page like this config: (https://drive.google.com/file/d/1ZsSDafsJOynLs_5utcsUCiXcpig1tA7m/view?usp=sharing)
5.Select yolo3 weight file and enable vision detection in computing page
6.Close Autoware programs and then close nvprof
7.There will visual profiler export file as download link, and we can find that ndt_matching is influenced by yolo.(https://drive.google.com/file/d/1XeDTO_Wy7yJZCNIVgCrZcToAWJT2dNC4/view?usp=sharing)

Dear cyan.chiu,
It could be due the bandwidth difference between E3550 and another system. Could you please check CUDA bandwidth sample to confirm this.
Also, could you take a look at our YOLOv3 sample in TensorRT(https://docs.nvidia.com/deeplearning/sdk/tensorrt-sample-support-guide/index.html#yolov3_onnx). Once you have ONNX file from python sample on Host, you can use C++ APIs on Drive AGX platform to perform inference.

Dear SivaRamaKrishna,
Thanks for your reply, I will check CUDA bandwidth and to see if there is any more information to analyze.

Dear SivaRamaKrishna,
The cuda bandwidth as below for your reference, it looks like not the root cause since another system host to device bandwidth is lower than Xavier system.
For the yolo3 sample, Do you think we should use tensorRT on NV system rather than original yolo3?

Machine______________GPU(Device 0)_______Bandwidth(MB/S)
__________________________________________Host to Device_____Device to Host_____Device to Device

Another System_______GeForce GTX 1080________12007.4_____________12850.8___________231363.8______
Xavier-A_____________Xavier__________________32565.6_____________32524.1____________89942.9______
Xavier-B_____________Xavier__________________32701.9_____________32626.1___________102595.3______

Dear cyan.chiu,
I could see in the nvvp files, The cudamemcpy execution time in GPU context is less but Runtime API calls huge. So huge time is spent in transitioning control from CPU thread launch call to GPU context. COuld you share details about load on system such as any other CPU threads runnning parallel? or Is there any other GPU context running on board(for display or using from another purpose). Is it possible to run this test alone to get more insights?

If you are looking to reduce yolov3 inference time, we have included a yolov3 tensorRT sample in TensorRT 5.1.0 RC. Please make use of that.

Dear cyan.chiu,
I zoomed into the time line to look carefully. I need to clarify few things here.

  • Please check CUDA deviceQuery sample on GTX 1080 and Xavier iGPU to know the specs of both GPUs
  • If you are looking at Runtime API time line, it shows when that call was triggered from CPU thread and when it gets finished.
  • I could see couple of CUDA kernel launches before the cudaMemcpy. Note that cudaMemcpy is a blocking call and kernel launches are non blocking call. So in this case, the CPU gets blocked till all the kernels gets finished and memcpy operation. The actual time consumed in memcpy can be in memcpy time line which is less. I hope this clarifies the confusion

Post fail…I post again

Dear SivaRamaKrishna,
We record all process and run yolo3 feature alone, the nvvp file as below:
https://drive.google.com/open?id=17Tv0a5McxtbdIZ8jorg68kF6zRPSrZDd

And there are htop images for your information:
yolo3 environment setup ready:https://drive.google.com/open?id=1IPdSOOugni371Mdr-GLuZrXtTYnnlidI
yolo3 running:https://drive.google.com/open?id=1EB0pFY47oUDz-2IjDQGJQA9I_gyYF5E0

Let me confirm something about the nvprof, if we start nvprof and then start yolo, it just record one thread that running yolo. Does this means only yolo use gpu now? or nvprof can’t record thread run before it?
How can we get more accurate with what processes use gpu on Xavier and help find the root cause about this issue? Since nvidia-smi not support on Xavier.

Dear cyan.chiu,
You can check iGPU utilization using Tegrastats(https://docs.nvidia.com/drive/active/5.0.13.0L/nvvib_docs/index.html#page/DRIVE%2520Linux%2520DDP%2520PDK%2520Development%2520Guide%2FUtilities%2FAppendixTegraStats.html%23). But it does not show the processes using GPU like nvidia-smi.
As I see, In the profiling results, it is expected for the cudamemcpy to wait until all the GPU kernels that were launched before it to finish as all CUDA calls are launched default stream. You would notice this behaviour on any GPU for this code. It is primarily because the way CUDA kernel calls are launched in the code.
You can explore checking TensorRT yolov3 sample if you wish to get optimized TensorRT model for Drive AGX platform.