Probleme with training/pruning tlt

Hello everyone,
I am planning to use yolov3 with jetson NX for object detection (one classe for now).
After the training part, I’ve weird results with pruning, here my logs :

2020-09-14 07:59:02.401363: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
Using TensorFlow backend.
2020-09-14 07:59:08.733838: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-09-14 07:59:08.808197: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-14 07:59:08.809463: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: TITAN RTX major: 7 minor: 5 memoryClockRate(GHz): 1.77
pciBusID: 0000:01:00.0
2020-09-14 07:59:08.809513: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-09-14 07:59:08.862914: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-09-14 07:59:08.892049: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-09-14 07:59:08.903280: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-09-14 07:59:08.969313: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-09-14 07:59:09.014495: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-09-14 07:59:09.126092: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-09-14 07:59:09.126417: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-14 07:59:09.128208: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-14 07:59:09.129815: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-09-14 07:59:09.130756: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-09-14 07:59:11.048474: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-09-14 07:59:11.048544: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-09-14 07:59:11.048571: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-09-14 07:59:11.048980: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-14 07:59:11.051577: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-14 07:59:11.053667: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-09-14 07:59:11.055275: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 22504 MB memory) -> physical GPU (device: 0, name: TITAN RTX, pci bus id: 0000:01:00.0, compute capability: 7.5)
2020-09-14 07:59:12,577 [INFO] modulus.pruning.pruning: Exploring graph for retainable indices
2020-09-14 07:59:15,553 [INFO] modulus.pruning.pruning: Pruning model and appending pruned nodes to new graph
2020-09-14 08:00:31,284 [INFO] iva.common.magnet_prune: Pruning ratio (pruned model / original model): 0.02986864959372703

Is the ratio 0.0298 regular?

After retraining my pruned model, I came to 2.5 it/s (batch 1 image) on a Titan RTX with fp32, and it’s really slow for my computer (Typical tensorflow inférenece was about 10 fps), and is the same if I choose to use fp16.
What have been doing wrong?

“0.0298” is the prune ratio. It depends on how much you want to prune. With different prune ratio, end user can trigger experiments to find the best combination between mAP and fps.

When you were talking about 2.5its/, is it the result of running tlt-infer?

Yep but i thought it was really low (it means that a lot of the model was pruned right?)
I did leave initial spec on pruning, didn’t touch anything from the notebook (-pth 0.1)

My training and testing are still made on TITAN RTX.

Yes.

Yes, the model size will be about 2.98% of unpruned model.
Are you training the public KITTI dataset or your own dataset? Actually you do not care much about the the default pth. Trying different pths or more retraining, end user can trigger experiments to find the best combination between mAP and fps.

For the 2.5it/s, I am checking it. Recently, two customers ask the same question about low speed of tlt-infer.

With that size I ll have thought it will be faster :D

I am using my own dataset. (50k annoted images total, 5h by epoch with titan rtx)

Thanks, remember that when I go from fp32 to fp16 the inference speed didnt evolve. It acted like it was capted.

Seems that you misunderstand the time of tlt-infer. The tlt-infer will write bbox to images and also write label files.
It does not mean the inference time.

For how to check the inference time, you can run trtexec.
Reference: Measurement model speed

With this way, the fp32 and fp16 should be different at inference speed.

Ok Thanks, so what I understand is that the tlt-infer is not dependant of model quantization and pruning.

I have other question :
Non max suppression not seems to be implemented,
Yolo entry size, no option like 4164163 images or something else, where do we change that?
Yolo spec file specify output_width & heigths, what are they used for

Should I Open new subjects?

NMS is implemented in yolo_v3.
Where did you see " no option like 416 416 3 images"?
For output_width & heigths, see https://docs.nvidia.com/metropolis/TLT/tlt-getting-started-guide/index.html#augmentation_module

Ok thanks, did not see that it was minimum 480. With the initial yolov3 and resnet backbone, there is only three size available, 416x416, 320x320, 608x608. I did not understand that you pushed ‘new values’ with that adaptation.

For yolo_v3, please see the requirement in tlt user guide.

YOLOv3

Input size: C * W * H (where C = 1 or 3, W >= 128, H >= 128, W, H are multiples of 32)
Image format: JPG, JPEG, PNG
Label format: KITTI detection