leo2105
February 16, 2022, 8:23pm
1
Hi
I am trying to run a evaluation command using detectnet_v2 in TLT 3.0 in a DGX of A100-SXM4-40GB.
detectnet_v2 evaluate
-e /workspace/trafficcamnet/specs_data_ds_tlt_v3/trafficcamnet_train_test.txt
-m /workspace/trafficcamnet/experiment_dir_unpruned/model.step-12360.tlt
inside a TAO container.
This is my spec file:
spec_trafficcamnet_train.txt.txt (7.4 KB)
So, I got this error of OOM (out of memory), I don’t know why. And another thing, Why TLT is using all my gpu memory?
error output:
error_oom_evaluate_detectnet_v2.txt (30.8 KB)
gpu memory:
Thanks in advance.
Morganh
February 17, 2022, 1:45am
3
Could you try run evaluation on another gpu by adding
--
gpu_index 1
leo2105
February 17, 2022, 3:06pm
4
Sometimes works but sometimes no in gpu 0 or 1. It’s so weird.
leo2105
February 17, 2022, 3:45pm
5
Do you think the problem is because of other people are running some processes in that gpu at the same time? But before running detectnet_v2 evaluation
I always check in nvidia-smi that the usage of gpu is 0. So, it shouldnt be the problem right?
or there is a way to clean up the gpu memory to ensure that nobody is using.
Morganh
February 17, 2022, 3:49pm
6
Can you open a new terminal to run nvidia-smi outside the docker?
leo2105
February 17, 2022, 4:08pm
7
In theory, nobody is using…
I am only using devices 0 and 1
Morganh
February 17, 2022, 4:21pm
8
Previously, how did you launch tao docker ?
leo2105
February 17, 2022, 4:48pm
9
Im using docker-compose
version: "3.9"
services:
tlt:
runtime: nvidia
build:
context: tlt
dockerfile: Dockerfile_tlt
shm_size: 1g
ipc: host
stdin_open: true
tty: true
ulimits:
memlock: -1
stack: 67108864
environment:
- NVIDIA_VISIBLE_DEVICES=${VISIBLE_GPUS}
volumes:
- minio_data_tlt:/workspace/data_ds_tlt_v3 # data
- ${MODELS_SPECS_PATH}:/workspace/trafficcamnet # models
Dockerfile:
FROM nvcr.io/nvidia/tlt-streamanalytics:v3.0-py3
VISIBLE_GPUS=0,1
docker-compose up --build -d
Morganh
February 18, 2022, 5:15am
10
system
Closed
March 15, 2022, 3:06am
13
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.