Issue Description: Now running autonomous driving processes on Orin X. The Orin X has 12 CPU cores. The perception module includes multiple inference models running in parallel. When the perception process runs alone, it consumes about 200% CPU (i.e., 2 cores fully utilized). At that time, although GPU resources are also limited, the model inference latency is relatively short and stable. However, when other processes such as localization and planning/control are started together, the total system CPU usage reaches about 800%. At this point, the perception inference latency becomes longer and fluctuates significantly. The other processes do not use GPU resources. What is the problem, and how should I debug it?
Dear @qiuwen ,
Is it due to the CPU thread moved to different core due to schedule of new processes? Did you try CPU affinity or pin the process to specific core to avoid moving to another core which could induced delay in launching GPU kernels. Also, check if memory utillization by other process effecting perception process. You can try nsys timeline view to get some more details.
I tried using cpuset to allocate 5 cores to the perception process, and the situation improved slightly, but the problem still persists. I also used nsys to collect performance data and found that there is no significant difference in memory bandwidth usage when running only the perception process versus running all processes. What should I do next to debug and resolve this issue?
Dear @SivaRamaKrishnaNV
I tried using cpuset to allocate 5 cores to the perception process, and the situation improved slightly, but the problem still persists. I also used nsys to collect performance data and found that there is no significant difference in memory bandwidth usage when running only the perception process versus running all processes. What should I do next to debug and resolve this issue?
Did you check the process timeline view in nsys? For both cases, you can compare the inference of one frame and see what is happening in both cases( like any delay in CUDA kernel launches, CPU context switches etc..).