Thanks for your replay,
I have tried all the suggestions you provided earlier, but they didn’t work. Due to confidentiality mechanisms, I can’t provide the source code.
I used hybrid runtime code in GitHub - NVIDIA-AI-IOT/cuDLA-samples: YOLOv5 on Orin DLA to deploy and profile my dla model. Here is the redults.
07-25-2024 10:47:28 RAM 3636/30593MB (lfb 5031x4MB) SWAP 0/15296MB (cached 0MB) CPU [0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,off,off,off,off] EMC_FREQ 0%@665 GR3D_FREQ 0%@[0,0] VIC_FREQ 115 APE 174 CV0@-256C CPU@47.968C Tboard@36C SOC2@44.656C Tdiode@37.25C SOC0@45.031C CV1@-256C GPU@-256C tj@47.968C SOC1@43.875C CV2@-256C VDD_GPU_SOC 1945mW/1945mW VDD_CPU_CV 388mW/388mW VIN_SYS_5V0 2214mW/2214mW NC 0mW/0mW VDDQ_VDD2_1V8AO 201mW/201mW NC 0mW/0mW
07-25-2024 10:47:29 RAM 3636/30593MB (lfb 5031x4MB) SWAP 0/15296MB (cached 0MB) CPU [0%@729,0%@729,0%@729,0%@729,0%@729,1%@729,0%@729,0%@729,off,off,off,off] EMC_FREQ 0%@204 GR3D_FREQ 0%@[0,0] VIC_FREQ 115 APE 174 CV0@-256C CPU@47.875C Tboard@36C SOC2@44.75C Tdiode@37.5C SOC0@45.25C CV1@-256C GPU@-256C tj@47.875C SOC1@44C CV2@-256C VDD_GPU_SOC 1945mW/1945mW VDD_CPU_CV 388mW/388mW VIN_SYS_5V0 2217mW/2215mW NC 0mW/0mW VDDQ_VDD2_1V8AO 201mW/201mW NC 0mW/0mW
07-25-2024 10:47:30 RAM 3702/30593MB (lfb 5031x4MB) SWAP 0/15296MB (cached 0MB) CPU [4%@1728,12%@1728,0%@1728,0%@1728,0%@729,0%@729,0%@729,0%@729,off,off,off,off] EMC_FREQ 0%@665 GR3D_FREQ 11%@[305,0] VIC_FREQ 115 NVDLA0_FREQ @1369 APE 174 CV0@46.468C CPU@48.25C Tboard@36C SOC2@44.781C Tdiode@37.5C SOC0@45.031C CV1@45.531C GPU@42.906C tj@48.062C SOC1@44C CV2@42.437C VDD_GPU_SOC 3110mW/2333mW VDD_CPU_CV 1166mW/647mW VIN_SYS_5V0 2923mW/2451mW NC 0mW/0mW VDDQ_VDD2_1V8AO 504mW/302mW NC 0mW/0mW
07-25-2024 10:47:31 RAM 3702/30593MB (lfb 5031x4MB) SWAP 0/15296MB (cached 0MB) CPU [4%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,off,off,off,off] EMC_FREQ 9%@2133 GR3D_FREQ 56%@[305,0] VIC_FREQ 115 NVDLA0_FREQ @1369 APE 174 CV0@46.593C CPU@48.156C Tboard@36C SOC2@44.906C Tdiode@37.5C SOC0@45.375C CV1@45.5C GPU@42.625C tj@48.156C SOC1@44.125C CV2@42.687C VDD_GPU_SOC 3884mW/2721mW VDD_CPU_CV 1553mW/873mW VIN_SYS_5V0 4947mW/3075mW NC 0mW/0mW VDDQ_VDD2_1V8AO 1716mW/655mW NC 0mW/0mW
07-25-2024 10:47:32 RAM 3702/30593MB (lfb 5031x4MB) SWAP 0/15296MB (cached 0MB) CPU [0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,off,off,off,off] EMC_FREQ 14%@2133 GR3D_FREQ 55%@[305,0] VIC_FREQ 115 NVDLA0_FREQ @1369 APE 174 CV0@46.656C CPU@48.093C Tboard@36C SOC2@44.937C Tdiode@37.5C SOC0@45.187C CV1@45.75C GPU@42.843C tj@48.093C SOC1@44.312C CV2@42.468C VDD_GPU_SOC 3884mW/2953mW VDD_CPU_CV 1553mW/1009mW VIN_SYS_5V0 5249mW/3510mW NC 0mW/0mW VDDQ_VDD2_1V8AO 1918mW/908mW NC 0mW/0mW
07-25-2024 10:47:33 RAM 3702/30593MB (lfb 5031x4MB) SWAP 0/15296MB (cached 0MB) CPU [0%@729,1%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,off,off,off,off] EMC_FREQ 17%@2133 GR3D_FREQ 55%@[305,0] VIC_FREQ 115 NVDLA0_FREQ @1369 APE 174 CV0@47.156C CPU@48.218C Tboard@36C SOC2@45.125C Tdiode@37.5C SOC0@45.281C CV1@45.593C GPU@42.875C tj@48.218C SOC1@44.062C CV2@42.687C VDD_GPU_SOC 3884mW/3108mW VDD_CPU_CV 1553mW/1100mW VIN_SYS_5V0 5249mW/3799mW NC 0mW/0mW VDDQ_VDD2_1V8AO 1918mW/1076mW NC 0mW/0mW
07-25-2024 10:47:34 RAM 3702/30593MB (lfb 5031x4MB) SWAP 0/15296MB (cached 0MB) CPU [2%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,0%@729,off,off,off,off] EMC_FREQ 18%@2133 GR3D_FREQ 55%@[305,0] VIC_FREQ 115 NVDLA0_FREQ @1369 APE 174 CV0@46.625C CPU@48.281C Tboard@36C SOC2@45.062C Tdiode@37.75C SOC0@45.218C CV1@45.781C GPU@42.843C tj@48.187C SOC1@44.375C CV2@42.656C VDD_GPU_SOC 3884mW/3219mW VDD_CPU_CV 1553mW/1164mW VIN_SYS_5V0 5350mW/4021mW NC 0mW/0mW VDDQ_VDD2_1V8AO 1918mW/1196mW NC 0mW/0mW
07-25-2024 10:47:35 RAM 3637/30593MB (lfb 5031x4MB) SWAP 0/15296MB (cached 0MB) CPU [0%@729,0%@729,1%@729,0%@729,4%@729,4%@729,0%@729,0%@729,off,off,off,off] EMC_FREQ 11%@2133 GR3D_FREQ 0%@[0,0] VIC_FREQ 115 APE 174 CV0@-256C CPU@48.312C Tboard@36C SOC2@44.937C Tdiode@37.75C SOC0@45.218C CV1@-256C GPU@-256C tj@48.156C SOC1@44.156C CV2@-256C VDD_GPU_SOC 3110mW/3205mW VDD_CPU_CV 1165mW/1164mW VIN_SYS_5V0 3729mW/3984mW NC 0mW/0mW VDDQ_VDD2_1V8AO 604mW/1122mW NC 0mW/0mW
When calling the submitTask interface of cudla, the GR3D_FREQ metric significantly increases. I believe this metric indicates GPU usage. Why does submitting a DLA task cause GPU usage to rise? My DLA model is a standalone model and expected to run entirely on the DLA. In comparison, when the submitTask is commented out, the GR3D_FREQ metric remains at 0.
Here is the nsight-systems profiling result, I can’t figure out where the GPU is used.
I want to know if this potential GPU usage is the reason behind the less-than-expected performance when the DLA model is integrated into my GPU-intensive application. Where does this potential GPU usage come from, and how can it be avoided?
Thanks.