When I run yolov3 on AgX (Python version, accelerated with tensorrt), inference is fast, but NMS is slow. I’m sure NMS is very fast when tested separately. So my question is, how does AgX allocate computing resources?
May I know how do you implement the NMS operation?
For TensorRT implementation, it is CUDA code and should get the similar resource of other kernel tasks.
If the NMS is implemented with CPU, there are some dependencies on the previous GPU job and might need to wait.
Hi, Thanks for your reply.
In my code, NMS operations is implemented with GPU, but the output of yolov3 inference are also on the GPU, so I don’t think it is the reason of the previous GPU job. Actually, I tested the time spent in each part of the NMS code, and the results showed that the “torch.nonzero” takes about 16 ms, which is too slow.
When tested separately, it only takes about 0.2ms(as shown in Figure).
Is this a matter of computing resource allocation?
Another possible reason it that there is some dependency between other layers and NMS.
So NMS need to wait for the input being ready.
It’s recommended to check the performance with nvprof first.
It can show you the detail about performance as well as GPU utilization.
Please run the following command to generate the profiling file.
$ sudo /usr/local/cuda-10.2/bin/nvprof -o output.nvvp python3 [your app]
Then open the output.nvvp with NVIDIA Visual Profiler on host, which is integrated into the CUDA toolkit:
Thanks. I will try it!