Jetson orin nano 4G 上转 OSNet模型(2.2M),workmemory=256,推理的时候为什么会占用1G内存?

I am currently running inference on the OSNet model (2.2M) using a Jetson Orin Nano 4G device. Despite setting the workmemory parameter to 256, I have observed that the memory usage during inference exceeds 1GB. This behavior seems unexpected given the relatively small size of the model.

Here are the details of my setup and observations:

  • Device: Jetson Orin Nano 4G
  • Model: OSNet (2.2M)
  • Work Memory Setting: workmemory=256
  • Memory Usage During Inference: >1GB

Hi,

Some memory will be required to load the libraries, ex CUDA and cuDNN.

Which inference frameworks do you use?
If TensorRT is chosen, could you try to run it without cuDNN?

This can be done via setting --tacticSources.

Thanks.

trtexec --onnx=PED_EXT_021.onnx --saveEngine=SKY_OSNet_f16_uSW_w1800_0124.trt --fp16 --memPoolSize=workspace:256 --useSpinWait --tacticSources=-CUDNN

Benchmark

trtexec --loadEngine=SKY_OSNet_f16_uSW_w256.trt --shapes=images:1x3x128x256 --iterations=1000 --verbose

During the benchmarking process, GPU memory usage increased from 850MB to 2.1GB.

Hi, Could you tell me a better way to resolve the memory question?

Hi,

Have you tried to load the engine without cuDNN?
Thanks.

trtexec --loadEngine=SKY_OSNet_f16_uSW_w256.trt --shapes=images:1x3x128x256 --iterations=1000 --verbose --tacticSources=-CUDNN

During the benchmarking process, GPU memory usage increased from 850MB to 2.1GB.

Hi,

Could you share the OSNet ONNX model with us so we can give it a check?
Which software do you use? Is it JetPack 6.2?

Thanks.

Hi,

Could you test it with the latest JetPack 6.2 release?
We ran the model on JetPack6.2/TensorRT10.3 and the memory increased ~150Mb.

Before

02-27-2025 05:16:22 RAM 2842/7620MB (lfb 19x4MB) SWAP 11/3810MB (cached 0MB) ...

Ongoing

02-27-2025 05:16:26 RAM 2992/7620MB (lfb 6x4MB) SWAP 11/3810MB (cached 0MB) CPU [8%@1510,2%@1510,0%@1510,20%@1510,41%@1510,0%@1510] EMC_FREQ 13%@2133 GR3D_FREQ 95%@[624] NVDEC off NVJPG off NVJPG1 off VIC off OFA off APE 200 cpu@47.031C soc2@46.468C soc0@43.812C gpu@47.375C tj@47.375C soc1@45.468C VDD_IN 7710mW/6402mW VDD_CPU_GPU_CV 2931mW/2054mW VDD_SOC 1825mW/1676mW ...

After

02-27-2025 05:16:29 RAM 2846/7620MB (lfb 19x4MB) SWAP 11/3810MB (cached 0MB) CPU [14%@1510,11%@1510,0%@1510,0%@1510,1%@1510,1%@1510] EMC_FREQ 7%@2133 GR3D_FREQ 5%@[624] NVDEC off NVJPG off NVJPG1 off VIC off OFA off APE 200 cpu@46.812C soc2@46.312C soc0@44.125C gpu@46.875C tj@46.875C soc1@45.406C VDD_IN 5214mW/6560mW VDD_CPU_GPU_CV 1271mW/2160mW VDD_SOC 1512mW/1693mW ...

We introduce a lazy loading feature in the CUDA 11.8 which can save the memory usage.
Thanks.