Using ONNX Runtime with TensorRT on Jetson Devices

Hello,

I have been trying to use ONNX Runtime with the TensorRT Execution Provider on Jetson devices (TX2, Xavier, Nano) and I have had some success using basic models (ResNets). However, when trying load more complex models (in particular SlowFast models) with 3D convolutions I seem to run into problems.

I have documented the problems in detail here: https://github.com/microsoft/onnxruntime/issues/3240

If anyone has any insights it would be much appreciated!

Hi,

It looks like you expand TensorRT workspace into 6GB, which is very closed TX2 limit (system takes 1.xG).
Could you check the device status with tegrastats to see if the physical memory is enough or not?

 $ sudo tegrastats

Thanks.

Hello thanks for the quick response,

In the issue I mention that I get Cuda Error in free when ORT_TENSORRT_MAX_WORKSPACE is too small but it also seems to happen when it is too big.

When I set ORT_TENSORRT_MAX_WORKSPACE=2147483648 on the Jetson TX2 i run into Cuda Error in allocate: 2 (out of memory), my tegrastats output is as follows:

RAM 309/7852MB (lfb 1810x4MB) SWAP 0/3926MB (cached 0MB) CPU [1%@345,off,off,0%@345,3%@345,1%@345] EMC_FREQ 2%@1331 GR3D_FREQ 0%@114 APE 150 PLL@40.5C MCPU@40.5C PMIC@100C Tboard@37C GPU@38.5C BCPU@40.5C thermal@39.4C Tdiode@38.25C VDD_SYS_GPU 229/229 VDD_SYS_SOC 765/768 VDD_4V0_WIFI 0/0 VDD_IN 3024/3028 VDD_SYS_CPU 153/156 VDD_SYS_DDR 537/538
RAM 310/7852MB (lfb 1810x4MB) SWAP 0/3926MB (cached 0MB) CPU [1%@345,off,off,0%@345,0%@345,0%@345] EMC_FREQ 2%@1331 GR3D_FREQ 0%@114 APE 150 PLL@40C MCPU@40C PMIC@100C Tboard@37C GPU@38.5C BCPU@40C thermal@39.6C Tdiode@38.25C VDD_SYS_GPU 229/229 VDD_SYS_SOC 765/767 VDD_4V0_WIFI 0/0 VDD_IN 3024/3028 VDD_SYS_CPU 153/155 VDD_SYS_DDR 537/538
RAM 309/7852MB (lfb 1810x4MB) SWAP 0/3926MB (cached 0MB) CPU [1%@345,off,off,0%@345,0%@345,0%@345] EMC_FREQ 2%@1331 GR3D_FREQ 0%@114 APE 150 PLL@40C MCPU@40C PMIC@100C Tboard@37C GPU@38.5C BCPU@40C thermal@39.4C Tdiode@38.25C VDD_SYS_GPU 229/229 VDD_SYS_SOC 765/767 VDD_4V0_WIFI 0/0 VDD_IN 3024/3028 VDD_SYS_CPU 153/155 VDD_SYS_DDR 537/538
RAM 325/7852MB (lfb 1797x4MB) SWAP 0/3926MB (cached 0MB) CPU [40%@652,off,off,3%@652,1%@652,0%@652] EMC_FREQ 2%@1331 GR3D_FREQ 0%@114 APE 150 PLL@40.5C MCPU@40.5C PMIC@100C Tboard@37C GPU@38.5C BCPU@40.5C thermal@39.7C Tdiode@38.25C VDD_SYS_GPU 229/229 VDD_SYS_SOC 841/770 VDD_4V0_WIFI 0/0 VDD_IN 3942/3060 VDD_SYS_CPU 535/169 VDD_SYS_DDR 768/546
RAM 508/7852MB (lfb 1696x4MB) SWAP 0/3926MB (cached 0MB) CPU [23%@1961,off,off,28%@2035,7%@2036,8%@2034] EMC_FREQ 2%@1600 GR3D_FREQ 0%@114 APE 150 PLL@40.5C MCPU@40.5C PMIC@100C Tboard@37C GPU@38.5C BCPU@40.5C thermal@39.9C Tdiode@38.25C VDD_SYS_GPU 229/229 VDD_SYS_SOC 918/775 VDD_4V0_WIFI 0/0 VDD_IN 4324/3104 VDD_SYS_CPU 535/181 VDD_SYS_DDR 902/558
RAM 663/7852MB (lfb 1634x4MB) SWAP 0/3926MB (cached 0MB) CPU [33%@1537,off,off,9%@1727,5%@1728,6%@1727] EMC_FREQ 2%@1600 GR3D_FREQ 0%@114 APE 150 PLL@41C MCPU@41C PMIC@100C Tboard@37C GPU@39C BCPU@41C thermal@40C Tdiode@38.25C VDD_SYS_GPU 229/229 VDD_SYS_SOC 918/780 VDD_4V0_WIFI 0/0 VDD_IN 4248/3142 VDD_SYS_CPU 535/193 VDD_SYS_DDR 902/570
RAM 1032/7852MB (lfb 1527x4MB) SWAP 0/3926MB (cached 0MB) CPU [77%@2036,off,off,2%@2035,0%@2036,0%@2036] EMC_FREQ 5%@1600 GR3D_FREQ 0%@114 APE 150 PLL@41C MCPU@41C PMIC@100C Tboard@37C GPU@39C BCPU@41C thermal@40C Tdiode@38.25C VDD_SYS_GPU 229/229 VDD_SYS_SOC 994/787 VDD_4V0_WIFI 0/0 VDD_IN 4782/3195 VDD_SYS_CPU 917/217 VDD_SYS_DDR 1152/589
RAM 2357/7852MB (lfb 1196x4MB) SWAP 0/3926MB (cached 0MB) CPU [100%@2035,off,off,0%@2035,0%@2037,0%@2036] EMC_FREQ 9%@1600 GR3D_FREQ 0%@114 APE 150 PLL@41C MCPU@41C PMIC@100C Tboard@37C GPU@39C BCPU@41C thermal@40.2C Tdiode@38.5C VDD_SYS_GPU 229/229 VDD_SYS_SOC 994/793 VDD_4V0_WIFI 0/0 VDD_IN 5049/3253 VDD_SYS_CPU 994/241 VDD_SYS_DDR 1324/612
RAM 2358/7852MB (lfb 1185x4MB) SWAP 0/3926MB (cached 0MB) CPU [90%@959,off,off,0%@960,0%@959,0%@960] EMC_FREQ 13%@1331 GR3D_FREQ 0%@114 APE 150 PLL@41C MCPU@41C PMIC@100C Tboard@37C GPU@39C BCPU@41C thermal@40.2C Tdiode@38.5C VDD_SYS_GPU 229/229 VDD_SYS_SOC 994/799 VDD_4V0_WIFI 0/0 VDD_IN 4820/3300 VDD_SYS_CPU 841/259 VDD_SYS_DDR 1171/629
RAM 2389/7852MB (lfb 1148x4MB) SWAP 0/3926MB (cached 0MB) CPU [55%@1728,off,off,1%@1728,0%@1728,0%@1729] EMC_FREQ 6%@1600 GR3D_FREQ 2%@114 APE 150 PLL@41C MCPU@41C PMIC@100C Tboard@37C GPU@39C BCPU@41C thermal@40.2C Tdiode@38.75C VDD_SYS_GPU 229/229 VDD_SYS_SOC 918/803 VDD_4V0_WIFI 0/0 VDD_IN 4324/3330 VDD_SYS_CPU 611/269 VDD_SYS_DDR 902/637
RAM 2448/7852MB (lfb 1103x4MB) SWAP 0/3926MB (cached 0MB) CPU [44%@1970,off,off,14%@2035,0%@2035,1%@2035] EMC_FREQ 5%@1600 GR3D_FREQ 1%@114 APE 150 PLL@41C MCPU@41C PMIC@100C Tboard@37C GPU@39C BCPU@41C thermal@40.2C Tdiode@38.5C VDD_SYS_GPU 229/229 VDD_SYS_SOC 918/806 VDD_4V0_WIFI 0/0 VDD_IN 4363/3360 VDD_SYS_CPU 611/279 VDD_SYS_DDR 921/645
RAM 2508/7852MB (lfb 1059x4MB) SWAP 0/3926MB (cached 0MB) CPU [57%@2034,off,off,3%@2035,0%@2035,0%@2035] EMC_FREQ 4%@1600 GR3D_FREQ 8%@114 APE 150 PLL@41C MCPU@41C PMIC@100C Tboard@37C GPU@39C BCPU@41C thermal@40.2C Tdiode@38.5C VDD_SYS_GPU 229/229 VDD_SYS_SOC 918/809 VDD_4V0_WIFI 0/0 VDD_IN 4324/3387 VDD_SYS_CPU 611/288 VDD_SYS_DDR 921/652
RAM 1492/7852MB (lfb 1237x4MB) SWAP 0/3926MB (cached 0MB) CPU [56%@2033,off,off,21%@2032,5%@2033,0%@2033] EMC_FREQ 4%@1600 GR3D_FREQ 0%@114 APE 150 PLL@41C MCPU@41C PMIC@100C Tboard@37C GPU@39C BCPU@41C thermal@40.2C Tdiode@38.5C VDD_SYS_GPU 229/229 VDD_SYS_SOC 918/812 VDD_4V0_WIFI 0/0 VDD_IN 4286/3411 VDD_SYS_CPU 764/301 VDD_SYS_DDR 902/659
RAM 635/7852MB (lfb 1440x4MB) SWAP 0/3926MB (cached 0MB) CPU [1%@345,off,off,8%@345,12%@345,0%@345] EMC_FREQ 3%@1331 GR3D_FREQ 0%@114 APE 150 PLL@40.5C MCPU@40.5C PMIC@100C Tboard@37C GPU@39C BCPU@40.5C thermal@39.9C Tdiode@38.5C VDD_SYS_GPU 229/229 VDD_SYS_SOC 841/813 VDD_4V0_WIFI 0/0 VDD_IN 3214/3406 VDD_SYS_CPU 306/301 VDD_SYS_DDR 595/657
RAM 636/7852MB (lfb 1440x4MB) SWAP 0/3926MB (cached 0MB) CPU [1%@345,off,off,0%@345,0%@345,0%@345] EMC_FREQ 3%@1331 GR3D_FREQ 0%@114 APE 150 PLL@40.5C MCPU@40.5C PMIC@100C Tboard@37C GPU@39C BCPU@40.5C thermal@39.9C Tdiode@38.25C VDD_SYS_GPU 229/229 VDD_SYS_SOC 765/812 VDD_4V0_WIFI 0/0 VDD_IN 3024/3396 VDD_SYS_CPU 153/297 VDD_SYS_DDR 537/654
RAM 635/7852MB (lfb 1440x4MB) SWAP 0/3926MB (cached 0MB) CPU [0%@345,off,off,0%@345,0%@345,0%@345] EMC_FREQ 3%@1331 GR3D_FREQ 0%@114 APE 150 PLL@40.5C MCPU@40.5C PMIC@100C Tboard@37C GPU@39C BCPU@40.5C thermal@39.9C Tdiode@38.25C VDD_SYS_GPU 229/229 VDD_SYS_SOC 765/810 VDD_4V0_WIFI 0/0 VDD_IN 3024/3387 VDD_SYS_CPU 153/294 VDD_SYS_DDR 537/651
RAM 635/7852MB (lfb 1440x4MB) SWAP 0/3926MB (cached 0MB) CPU [0%@345,off,off,0%@345,0%@345,0%@345] EMC_FREQ 2%@1331 GR3D_FREQ 0%@114 APE 150 PLL@40.5C MCPU@40.5C PMIC@100C Tboard@37C GPU@39C BCPU@40.5C thermal@39.9C Tdiode@38.25C VDD_SYS_GPU 229/229 VDD_SYS_SOC 765/809 VDD_4V0_WIFI 0/0 VDD_IN 3024/3378 VDD_SYS_CPU 153/290 VDD_SYS_DDR 537/649
RAM 635/7852MB (lfb 1440x4MB) SWAP 0/3926MB (cached 0MB) CPU [0%@345,off,off,0%@345,0%@345,0%@345] EMC_FREQ 2%@1331 GR3D_FREQ 0%@114 APE 150 PLL@40.5C MCPU@40.5C PMIC@100C Tboard@37C GPU@38.5C BCPU@40.5C thermal@39.9C Tdiode@38.25C VDD_SYS_GPU 229/229 VDD_SYS_SOC 765/808 VDD_4V0_WIFI 0/0 VDD_IN 3024/3369 VDD_SYS_CPU 153/287 VDD_SYS_DDR 537/646
RAM 635/7852MB (lfb 1440x4MB) SWAP 0/3926MB (cached 0MB) CPU [1%@345,off,off,0%@345,0%@345,0%@345] EMC_FREQ 2%@1331 GR3D_FREQ 0%@114 APE 150 PLL@40.5C MCPU@40.5C PMIC@100C Tboard@37C GPU@38.5C BCPU@40.5C thermal@39.7C Tdiode@38.5C VDD_SYS_GPU 229/229 VDD_SYS_SOC 765/807 VDD_4V0_WIFI 0/0 VDD_IN 3024/3361 VDD_SYS_CPU 153/284 VDD_SYS_DDR 537/643

When I set ORT_TENSORRT_MAX_WORKSPACE=4294967296 on the Jetson TX2 i run into Cuda Error in free: 4 (unspecified launch failure), my tegrastats output is as follows:

RAM 311/7852MB (lfb 1809x4MB) SWAP 0/3926MB (cached 0MB) CPU [0%@345,off,off,0%@345,0%@345,0%@345] EMC_FREQ 2%@1331 GR3D_FREQ 0%@114 APE 150 PLL@40.5C MCPU@40.5C PMIC@100C Tboard@37C GPU@38.5C BCPU@40.5C thermal@39.7C Tdiode@38.25C VDD_SYS_GPU 229/229 VDD_SYS_SOC 765/765 VDD_4V0_WIFI 0/0 VDD_IN 3024/3024 VDD_SYS_CPU 153/153 VDD_SYS_DDR 537/537
RAM 328/7852MB (lfb 1792x4MB) SWAP 0/3926MB (cached 0MB) CPU [44%@1428,off,off,1%@1585,2%@1573,4%@1574] EMC_FREQ 1%@1600 GR3D_FREQ 0%@114 APE 150 PLL@40.5C MCPU@40.5C PMIC@100C Tboard@37C GPU@38.5C BCPU@40.5C thermal@39.7C Tdiode@38.25C VDD_SYS_GPU 229/229 VDD_SYS_SOC 841/777 VDD_4V0_WIFI 0/0 VDD_IN 4095/3202 VDD_SYS_CPU 535/216 VDD_SYS_DDR 806/581
RAM 518/7852MB (lfb 1695x4MB) SWAP 0/3926MB (cached 0MB) CPU [41%@960,off,off,9%@499,2%@499,3%@498] EMC_FREQ 3%@1600 GR3D_FREQ 0%@114 APE 150 PLL@40.5C MCPU@40.5C PMIC@100C Tboard@37C GPU@38.5C BCPU@40.5C thermal@39.7C Tdiode@38.5C VDD_SYS_GPU 229/229 VDD_SYS_SOC 918/797 VDD_4V0_WIFI 0/0 VDD_IN 4363/3368 VDD_SYS_CPU 535/262 VDD_SYS_DDR 940/633
RAM 708/7852MB (lfb 1617x4MB) SWAP 0/3926MB (cached 0MB) CPU [42%@2036,off,off,5%@2035,2%@2036,9%@2034] EMC_FREQ 3%@1600 GR3D_FREQ 0%@114 APE 150 PLL@41C MCPU@41C PMIC@100C Tboard@37C GPU@38.5C BCPU@41C thermal@39.7C Tdiode@38.25C VDD_SYS_GPU 229/229 VDD_SYS_SOC 918/812 VDD_4V0_WIFI 0/0 VDD_IN 4286/3483 VDD_SYS_CPU 611/305 VDD_SYS_DDR 902/666
RAM 1233/7852MB (lfb 1476x4MB) SWAP 0/3926MB (cached 0MB) CPU [85%@2034,off,off,1%@2035,0%@2035,0%@2035] EMC_FREQ 6%@1600 GR3D_FREQ 0%@114 APE 150 PLL@41C MCPU@41C PMIC@100C Tboard@37C GPU@38.5C BCPU@41C thermal@40.2C Tdiode@38.5C VDD_SYS_GPU 229/229 VDD_SYS_SOC 994/832 VDD_4V0_WIFI 0/0 VDD_IN 4858/3635 VDD_SYS_CPU 917/373 VDD_SYS_DDR 1190/724
RAM 2394/7852MB (lfb 1186x4MB) SWAP 0/3926MB (cached 0MB) CPU [100%@2036,off,off,0%@2036,0%@2037,0%@2035] EMC_FREQ 10%@1600 GR3D_FREQ 0%@114 APE 150 PLL@41C MCPU@41C PMIC@100C Tboard@37C GPU@39C BCPU@41C thermal@40.3C Tdiode@38.75C VDD_SYS_GPU 229/229 VDD_SYS_SOC 994/849 VDD_4V0_WIFI 0/0 VDD_IN 5126/3784 VDD_SYS_CPU 1070/443 VDD_SYS_DDR 1344/786
RAM 2363/7852MB (lfb 1188x4MB) SWAP 0/3926MB (cached 0MB) CPU [75%@2028,off,off,19%@2031,0%@2032,0%@2033] EMC_FREQ 10%@1600 GR3D_FREQ 0%@114 APE 150 PLL@41C MCPU@41C PMIC@100C Tboard@37C GPU@39C BCPU@41C thermal@40.5C Tdiode@38.75C VDD_SYS_GPU 229/229 VDD_SYS_SOC 917/855 VDD_4V0_WIFI 0/0 VDD_IN 4667/3865 VDD_SYS_CPU 917/486 VDD_SYS_DDR 1094/814
RAM 2400/7852MB (lfb 1141x4MB) SWAP 0/3926MB (cached 0MB) CPU [46%@1966,off,off,4%@2034,1%@2034,0%@2034] EMC_FREQ 7%@1600 GR3D_FREQ 28%@114 APE 150 PLL@41C MCPU@41C PMIC@100C Tboard@37C GPU@39C BCPU@41C thermal@40.2C Tdiode@38.5C VDD_SYS_GPU 229/229 VDD_SYS_SOC 918/860 VDD_4V0_WIFI 0/0 VDD_IN 4324/3903 VDD_SYS_CPU 611/496 VDD_SYS_DDR 921/823
RAM 2452/7852MB (lfb 1098x4MB) SWAP 0/3926MB (cached 0MB) CPU [56%@2034,off,off,1%@2035,1%@2035,0%@2035] EMC_FREQ 5%@1600 GR3D_FREQ 4%@114 APE 150 PLL@41C MCPU@41C PMIC@100C Tboard@37C GPU@39C BCPU@41C thermal@40C Tdiode@38.75C VDD_SYS_GPU 229/229 VDD_SYS_SOC 918/864 VDD_4V0_WIFI 0/0 VDD_IN 4363/3938 VDD_SYS_CPU 611/505 VDD_SYS_DDR 921/831
RAM 2511/7852MB (lfb 1056x4MB) SWAP 0/3926MB (cached 0MB) CPU [34%@1514,off,off,17%@1513,1%@1515,9%@1512] EMC_FREQ 4%@1600 GR3D_FREQ 9%@114 APE 150 PLL@41C MCPU@41C PMIC@100C Tboard@37C GPU@39C BCPU@41C thermal@40.2C Tdiode@38.5C VDD_SYS_GPU 229/229 VDD_SYS_SOC 918/868 VDD_4V0_WIFI 0/0 VDD_IN 4248/3960 VDD_SYS_CPU 535/507 VDD_SYS_DDR 902/836
RAM 2621/7852MB (lfb 1020x4MB) SWAP 0/3926MB (cached 0MB) CPU [28%@345,off,off,0%@345,5%@345,11%@345] EMC_FREQ 5%@1600 GR3D_FREQ 94%@930 APE 150 PLL@41C MCPU@41C PMIC@100C Tboard@37C GPU@41C BCPU@41C thermal@40.2C Tdiode@39.5C VDD_SYS_GPU 1834/336 VDD_SYS_SOC 917/871 VDD_4V0_WIFI 0/0 VDD_IN 5623/4071 VDD_SYS_CPU 382/499 VDD_SYS_DDR 1017/848
RAM 2653/7852MB (lfb 1012x4MB) SWAP 0/3926MB (cached 0MB) CPU [5%@1804,off,off,1%@2034,13%@2029,31%@2034] EMC_FREQ 13%@1600 GR3D_FREQ 99%@1134 APE 150 PLL@41.5C MCPU@41.5C PMIC@100C Tboard@37C GPU@40.5C BCPU@41.5C thermal@41.4C Tdiode@41.75C VDD_SYS_GPU 3437/529 VDD_SYS_SOC 993/879 VDD_4V0_WIFI 0/0 VDD_IN 7418/4280 VDD_SYS_CPU 458/496 VDD_SYS_DDR 1344/879
RAM 710/7852MB (lfb 1439x4MB) SWAP 0/3926MB (cached 0MB) CPU [37%@499,off,off,7%@498,12%@499,22%@498] EMC_FREQ 7%@1600 GR3D_FREQ 0%@1134 APE 150 PLL@41.5C MCPU@41.5C PMIC@100C Tboard@37C GPU@39.5C BCPU@41.5C thermal@41C Tdiode@39.25C VDD_SYS_GPU 382/521 VDD_SYS_SOC 918/881 VDD_4V0_WIFI 0/0 VDD_IN 4248/4278 VDD_SYS_CPU 611/503 VDD_SYS_DDR 921/881
RAM 710/7852MB (lfb 1439x4MB) SWAP 0/3926MB (cached 0MB) CPU [2%@345,off,off,0%@345,0%@345,0%@345] EMC_FREQ 4%@1600 GR3D_FREQ 0%@1134 APE 150 PLL@41C MCPU@41C PMIC@100C Tboard@37C GPU@39C BCPU@41C thermal@40.2C Tdiode@38.75C VDD_SYS_GPU 229/504 VDD_SYS_SOC 841/879 VDD_4V0_WIFI 0/0 VDD_IN 3444/4232 VDD_SYS_CPU 229/488 VDD_SYS_DDR 806/877
RAM 710/7852MB (lfb 1439x4MB) SWAP 0/3926MB (cached 0MB) CPU [0%@345,off,off,0%@345,0%@345,0%@345] EMC_FREQ 3%@1600 GR3D_FREQ 0%@1134 APE 150 PLL@41C MCPU@41C PMIC@100C Tboard@37C GPU@39C BCPU@41C thermal@40.2C Tdiode@38.75C VDD_SYS_GPU 153/486 VDD_SYS_SOC 841/877 VDD_4V0_WIFI 0/0 VDD_IN 3367/4186 VDD_SYS_CPU 153/470 VDD_SYS_DDR 787/872
RAM 710/7852MB (lfb 1439x4MB) SWAP 0/3926MB (cached 0MB) CPU [1%@345,off,off,0%@345,0%@345,0%@345] EMC_FREQ 2%@1600 GR3D_FREQ 0%@1134 APE 150 PLL@40.5C MCPU@40.5C PMIC@100C Tboard@37C GPU@39C BCPU@40.5C thermal@40.2C Tdiode@38.75C VDD_SYS_GPU 153/469 VDD_SYS_SOC 841/875 VDD_4V0_WIFI 0/0 VDD_IN 3367/4145 VDD_SYS_CPU 153/454 VDD_SYS_DDR 787/868

Neither of the tegrastats outputs suggest particularly high memory usage although I guess its quite possible that it is missing the peaks.

My coworker has managed to build the same version of ONNX Runtime on his desktop (also with TensorRT 6) and perform inference with the same model using the TensorRT execution provider. Therefore it seems as if this issue is Jetson specific. Could it be related to the RAM being shared between the CPU and the GPU?

Hi

Are you using Jetson Xavier?
If yes, it looks like the GPU architecture is incorrect for you.

Please update this change and built it from source again.

Modify cmake/CMakeLists.txt
    -  set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -gencode=arch=compute_50,code=sm_50") # M series
    +  set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -gencode=arch=compute_72,code=sm_72") # Jetson support

Thanks.