Out of memory Evaluating model in TAO


I am trying to run a evaluation command using detectnet_v2 in TLT 3.0 in a DGX of A100-SXM4-40GB.

detectnet_v2 evaluate 
     -e /workspace/trafficcamnet/specs_data_ds_tlt_v3/trafficcamnet_train_test.txt                                  
     -m /workspace/trafficcamnet/experiment_dir_unpruned/model.step-12360.tlt

inside a TAO container.

This is my spec file:
spec_trafficcamnet_train.txt.txt (7.4 KB)

So, I got this error of OOM (out of memory), I don’t know why. And another thing, Why TLT is using all my gpu memory?

error output:
error_oom_evaluate_detectnet_v2.txt (30.8 KB)

gpu memory:

Thanks in advance.

Could you try run evaluation on another gpu by adding
--gpu_index 1

Sometimes works but sometimes no in gpu 0 or 1. It’s so weird.

Do you think the problem is because of other people are running some processes in that gpu at the same time? But before running detectnet_v2 evaluation I always check in nvidia-smi that the usage of gpu is 0. So, it shouldnt be the problem right?
or there is a way to clean up the gpu memory to ensure that nobody is using.

Can you open a new terminal to run nvidia-smi outside the docker?

In theory, nobody is using…

I am only using devices 0 and 1

Previously, how did you launch tao docker ?

Im using docker-compose

version: "3.9"
                runtime: nvidia
                        context: tlt
                        dockerfile: Dockerfile_tlt
                shm_size: 1g
                ipc: host
                stdin_open: true
                tty: true
                       memlock: -1
                       stack: 67108864
                        - NVIDIA_VISIBLE_DEVICES=${VISIBLE_GPUS}
                        - minio_data_tlt:/workspace/data_ds_tlt_v3 # data 
                        - ${MODELS_SPECS_PATH}:/workspace/trafficcamnet # models


FROM nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3

docker-compose up --build -d

Could you follow TAO Toolkit Launcher — TAO Toolkit 3.21.11 documentation and retry? Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.