Tlt-infer is slow

Hi
I try to learn how to use transfer learning toolkit, so I build the docker environment by using nvcr.io/nvidia/tlt-streamanalytics.
I try to run the example detectnet_v2.ipynb in “/workspace/examples/detectnet_v2”. When I ran step 8 Visualize inferences, I found that it is very slow and my GPU-Util is close to 0. [It needs to spend 1.5 hour to finish the inference task.]

!tlt-infer detectnet_v2 -e $SPECS_DIR/detectnet_v2_inference_kitti_tlt.txt
-o $USER_EXPERIMENT_DIR/tlt_infer_testing
-i $DATA_DOWNLOAD_DIR/testing/image_2
-k $KEY

But I run step 7 Evaluate the retrained model, everything works well (GPU-util is about 30%~40%).

!tlt-evaluate detectnet_v2 -e $SPECS_DIR/detectnet_v2_retrain_resnet18_kitti.txt
-m $USER_EXPERIMENT_DIR/experiment_dir_retrain/weights/resnet18_detector_pruned.tlt
-k $KEY

Can you give me some suggestions how to improve the speed of tlt-infer ?

hardware:
RAM: 64GB
GPU: NVIDIA TITAN RTX
nvidia-driver: 450.51.05

1 Like

How many images in your $DATA_DOWNLOAD_DIR/testing/image_2?
More, if there are many objects in your image, the inference time is expected to long.

There are 7518 images for testing. I don’t think the inference speed is about my testing data because I see my GPU util is close to 0. My GPU util should be more.

Could you try to run default jupyter notebook to narrow down? It used KITTI dataset for training.

Well, I run the default jupyter notebook example now. The problem what I met is on above. I didn’t change the dataset(kitti dataset) and model’s spec file. I use the default setting for training , evaluating and testing.

Can you share the full log when you run

# Running inference for detection on n images
!tlt-infer detectnet_v2 -e $SPECS_DIR/detectnet_v2_inference_kitti_tlt.txt
-o $USER_EXPERIMENT_DIR/tlt_infer_testing
-i $DATA_DOWNLOAD_DIR/testing/image_2
-k $KEY

Using TensorFlow backend.
2020-09-15 05:27:32.371960: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-09-15 05:27:35,920 [INFO] iva.detectnet_v2.scripts.inference: Overlain images will be saved in the output path.
2020-09-15 05:27:35,920 [INFO] iva.detectnet_v2.inferencer.build_inferencer: Constructing inferencer
2020-09-15 05:27:35.926105: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-09-15 05:27:35.928934: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:86:00.0
2020-09-15 05:27:35.928998: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-09-15 05:27:35.929096: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-09-15 05:27:35.931737: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-09-15 05:27:35.931871: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-09-15 05:27:35.935717: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-09-15 05:27:35.938657: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-09-15 05:27:35.938791: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-09-15 05:27:35.942760: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-09-15 05:27:35.942808: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-09-15 05:27:36.814358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-09-15 05:27:36.814410: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0 
2020-09-15 05:27:36.814419: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N 
2020-09-15 05:27:36.818009: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22242 MB memory) -> physical GPU (device: 0, name: TITAN RTX, pci bus id: 0000:86:00.0, compute capability: 7.5)
2020-09-15 05:27:36,822 [INFO] iva.detectnet_v2.inferencer.tlt_inferencer: Loading model from /workspace/tlt-experiments/detectnet_v2/experiment_dir_retrain/weights/resnet18_detector_pruned.tlt:
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 3, 384, 1248)      0         
_________________________________________________________________
model_1 (Model)              [(None, 3, 24, 78), (None 6041719   
=================================================================
Total params: 6,041,719
Trainable params: 6,034,487
Non-trainable params: 7,232
_________________________________________________________________
2020-09-15 05:27:39,739 [INFO] iva.detectnet_v2.scripts.inference: Initialized model
2020-09-15 05:27:39,766 [INFO] iva.detectnet_v2.scripts.inference: Commencing inference
  0%|                                                   | 0/470 [00:00<?, ?it/s]2020-09-15 05:27:40.965787: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-09-15 05:27:42.547878: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
  1%|▎                                        | 4/470 [00:44<1:26:07, 11.09s/it]

The speed looks normal.

1%|▎ | 4/470 [00:44<1:26:07, 11.09s/it]

One batch has 16 images:
7518 / 470 = 15.995 --> 16

16imgs / 11.09s = 1.44 FPS

Is it really slow?? Why does it run slow? Drawing bbox on images or doing something? Why my GPU util is low??

The tlt-infer tool produces two outputs.

  1. Overlain images in $USER_EXPERIMENT_DIR/tlt_infer_testing/images_annotated
  2. Frame by frame bbox labels in kitti format located in $USER_EXPERIMENT_DIR/tlt_infer_testing/labels

@Morganh and @xwater8

Curious what made you accept the last comment as the solution.
I am actually seeing the exact same speed when I am following the same tutorial.
Speed of FP16 model is -

29%|███████████▊ | 135/470 [22:53<56:47, 10.17s/it]

And based on the output of

nvidia-smi

This is what I am seeing (please see attached image) -

Did you just accept the fact that its inference is slow?
Would appreciate your feedback.

Thanks!

See Probleme with training/pruning tlt

The tlt-infer will write bbox to images and also write label files.
But it does not mean the inference time.

For how to check the inference time, you can run trtexec.
Reference: Measurement model speed

@Morganh
Saw that exact link just now. Will check that when I run it on TX2.

Appreciate it