Tlt-infer detectnet_v2 fails - TypeError

Nezakka · June 29, 2020, 12:12pm

Hi. I am trying to use detectnet_v2 resnet18 pre-trained model with TLT. The notebook has so far given reasonable results, given I haven’t really tried to configure very much. It’s an experiment with 3 lego vehicles on a table.

The dataset is small, as I’m trying to get everything working before investing a lot of time on creating a larger dataset. Currently I have 70 training images and 6 testing images.

Training, based on resnet18_detector gave good results. Pruning and retraining ran without incident.

Validation cost: 0.000396
Mean average_precision (in %): 70.6779

class name average precision (in %)

bus 77.4306
followme 45.7143
police 88.8889

The problem came when I tried to run tlt-infer.

Running inference for detection on n images

!tlt-infer detectnet_v2 -e $SPECS_DIR/detectnet_v2_inference_kitti_tlt_lego.txt
-o $USER_EXPERIMENT_DIR/tlt_infer_testing
-i $DATA_DOWNLOAD_DIR/testing/image_2
-k $KEY

Any/all help would be greatly appreciated. Apologies in advanced if insufficient information has been given. Below is the result of the tlt-infer command.

Using TensorFlow backend.
2020-06-29 11:31:29,590 [INFO] iva.detectnet_v2.scripts.inference: Creating output inference directory
2020-06-29 11:31:29,590 [INFO] iva.detectnet_v2.scripts.inference: Overlain images will be saved in the output path.
2020-06-29 11:31:29,591 [INFO] iva.detectnet_v2.inferencer.build_inferencer: Constructing inferencer
2020-06-29 11:31:29,887 [INFO] iva.detectnet_v2.inferencer.tlt_inferencer: Loading model from /workspace/tlt-experiments/detectnet_v2/experiment_dir_retrain/weights/resnet18_detector_pruned.tlt:

Layer (type) Output Shape Param #

input_1 (InputLayer) (None, 3, 544, 1408) 0

model_1 (Model) [(None, 3, 34, 88), (None 11203023

Total params: 11,203,023
Trainable params: 11,193,295
Non-trainable params: 9,728

2020-06-29 11:31:32,852 [INFO] iva.detectnet_v2.scripts.inference: Initialized model
2020-06-29 11:31:32,852 [INFO] iva.detectnet_v2.scripts.inference: Commencing inference
0%| | 0/2 [00:00<?, ?it/s]Process PoolWorker-1:
Traceback (most recent call last):
File “/usr/lib/python2.7/multiprocessing/process.py”, line 258, in _bootstrap
self.run()
File “/usr/lib/python2.7/multiprocessing/process.py”, line 114, in run
self._target(*self._args, **self._kwargs)
File “/usr/lib/python2.7/multiprocessing/pool.py”, line 102, in worker
task = get()
File “/usr/lib/python2.7/multiprocessing/queues.py”, line 378, in get
return recv()
TypeError: new() takes exactly 4 arguments (2 given)
Process PoolWorker-2:
Traceback (most recent call last):
File “/usr/lib/python2.7/multiprocessing/process.py”, line 258, in _bootstrap
self.run()
File “/usr/lib/python2.7/multiprocessing/process.py”, line 114, in run
self._target(*self._args, **self._kwargs)
File “/usr/lib/python2.7/multiprocessing/pool.py”, line 102, in worker
task = get()
File “/usr/lib/python2.7/multiprocessing/queues.py”, line 378, in get
return recv()
TypeError: new() takes exactly 4 arguments (2 given)
Process PoolWorker-3:
Traceback (most recent call last):
File “/usr/lib/python2.7/multiprocessing/process.py”, line 258, in _bootstrap
self.run()
File “/usr/lib/python2.7/multiprocessing/process.py”, line 114, in run
self._target(*self._args, **self._kwargs)
File “/usr/lib/python2.7/multiprocessing/pool.py”, line 102, in worker
task = get()
File “/usr/lib/python2.7/multiprocessing/queues.py”, line 378, in get
return recv()
TypeError: new() takes exactly 4 arguments (2 given)
Process PoolWorker-4:
Traceback (most recent call last):
File “/usr/lib/python2.7/multiprocessing/process.py”, line 258, in _bootstrap
self.run()
File “/usr/lib/python2.7/multiprocessing/process.py”, line 114, in run
self._target(*self._args, **self._kwargs)
File “/usr/lib/python2.7/multiprocessing/pool.py”, line 102, in worker
task = get()
File “/usr/lib/python2.7/multiprocessing/queues.py”, line 378, in get
return recv()
TypeError: new() takes exactly 4 arguments (2 given)

Morganh · June 29, 2020, 4:25pm

Could you please paste your $SPECS_DIR/detectnet_v2_inference_kitti_tlt_lego.txt?

Nezakka · June 30, 2020, 4:44am

Thanks for your assistance. Here is the requested file.

detectnet_v2_inference_kitti_tlt_lego.txt (2.2 KB)

Morganh · June 30, 2020, 9:14am

Seems that there is not any wrong in your spec.
How many gpus in your host pc? 4?
More, would you please share more log after “Using TensorFlow backend.”?
There should be some log of your gpus.

Nezakka · June 30, 2020, 1:09pm

Hi. Thanks again for your quick response.

I have a 1 GPU laptop as per the spec below. I have attached all output from the Jupyter notebook.

After_Using_TensorFlow_backend.txt (3.3 KB)

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
±----------------------------------------------------------------------------+
Tue Jun 30 13:08:02 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2080 Off | 00000000:01:00.0 Off | N/A |
| N/A 57C P2 50W / N/A | 5197MiB / 7982MiB | 5% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
±----------------------------------------------------------------------------+

Nezakka · June 30, 2020, 1:33pm

Hi. Not sure if this helps, but I also tried a single image with very similar results.

I ran the following.

Running inference for detection on n images

!tlt-infer detectnet_v2 -e $SPECS_DIR/detectnet_v2_inference_kitti_tlt_lego.txt
-o $USER_EXPERIMENT_DIR/tlt_infer_testing
-i /workspace/tlt-experiments/data/testing/image_2/lego1-74.png
-k $KEY

Result was:

Morganh · July 1, 2020, 2:11am

Very strange. Could you attach the full log of your training?
More, when run tlt-infer, there should be some info about gpu as below. But your log is missing.

Using TensorFlow backend.
2020-06-30 09:09:52,763 [INFO] iva.detectnet_v2.scripts.inference: Overlain images will be saved in the output path.
2020-06-30 09:09:52,763 [INFO] iva.detectnet_v2.inferencer.build_inferencer: Constructing inferencer
2020-06-30 09:09:52.764218: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-06-30 09:09:52.830187: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x71b1400 executing computations on platform CUDA. Devices:
2020-06-30 09:09:52.830269: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2020-06-30 09:09:52.850890: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3499910000 Hz
2020-06-30 09:09:52.851949: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x721b250 executing computations on platform Host. Devices:
2020-06-30 09:09:52.852006: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): ,
2020-06-30 09:09:52.852235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:65:00.0
totalMemory: 10.91GiB freeMemory: 10.15GiB
2020-06-30 09:09:52.852280: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-06-30 09:09:53.078357: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-30 09:09:53.078416: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2020-06-30 09:09:53.078423: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2020-06-30 09:09:53.078536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9800 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1)
2020-06-30 09:09:53,079 [INFO] iva.detectnet_v2.inferencer.tlt_inferencer: Loading model from /workspace/tlt-experiments/detectnet_v2/experiment_dir_unpruned/weights/resnet18_detector.tlt:

Nezakka · July 1, 2020, 6:13am

Hi Morganh, I am re-running the training to get the training log. However there is no more info about my GPU as per your output for the tnt-infer step. Perhaps I misunderstand what you mean by log.

Is the log simply the output in the Jupyter notebook, or do I need to turn on a switch to get more verbose output? I have given you all that was output from the Jupyter notebook. I fear I’m missing something obvious. Can you give me a few more hints regarding the log?

Morganh · July 1, 2020, 6:26am

No switch is needed.
Can you post the training log?
Normally there should be some log about you device as above.

Morganh · July 1, 2020, 6:30am

More, please provide the cpu info of your laptop.

Nezakka · July 1, 2020, 6:34am

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
±----------------------------------------------------------------------------+

Nezakka · July 1, 2020, 6:35am

training_log.txt (57.3 KB)

Morganh · July 1, 2020, 6:37am

Please share the CPU info too. Thanks.
$ cat /proc/cpuinfo

Nezakka · July 1, 2020, 6:43am

Here’s the CPU output. Sorry, misread your earlier comment.

cpuinfo.txt (15.3 KB)

Nezakka · July 1, 2020, 6:50am

Retrain log. This is the model used for inference. I cannot see any reference to GPU however.

retraining_log.txt (57.2 KB)

Morganh · July 1, 2020, 6:51am

From your training log, smilar log as below is not available. This maybe the culprit.

2020-06-30 09:09:52.764218: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-06-30 09:09:52.830187: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x71b1400 executing computations on platform CUDA. Devices:
2020-06-30 09:09:52.830269: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2020-06-30 09:09:52.850890: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3499910000 Hz
2020-06-30 09:09:52.851949: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x721b250 executing computations on platform Host. Devices:
2020-06-30 09:09:52.852006: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): ,
2020-06-30 09:09:52.852235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:65:00.0
totalMemory: 10.91GiB freeMemory: 10.15GiB
2020-06-30 09:09:52.852280: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-06-30 09:09:53.078357: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-30 09:09:53.078416: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2020-06-30 09:09:53.078423: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2020-06-30 09:09:53.078536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/

Could you double check if your laptop meets software requirement?
https://docs.nvidia.com/metropolis/TLT/tlt-getting-started-guide/index.html#requirements

Software Requirements

Ubuntu 18.04 LTS
NVIDIA GPU Cloud account and API key - https://ngc.nvidia.com/
docker-ce installed, https://docs.docker.com/install/linux/docker-ce/ubuntu/
nvidia-docker2 installed, instructions: https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0)
NVIDIA GPU driver v410.xx or above

Nezakka · July 1, 2020, 7:00am

Hi again.

My system is Linux Mint, but based on Ubuntu 18.04 bionic. Attached detailed system information.

When installing Nvidia Docker 2, I registered the Ubuntu repositories manually as per this post: Using Linux Mint · Issue #848 · NVIDIA/nvidia-docker · GitHub

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/ubuntu18.04/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update

SystemInfo.txt (4.5 KB)

I would like to add that the first time I ran through with the normal KITTI dataset as in the default TLT Jupyter notebook. Everything worked fine.

Morganh · July 1, 2020, 7:04am

So, do you mean the tlt-infer is also fine at that time?

Nezakka · July 1, 2020, 7:07am

Yes. The first time I did this procedure I just followed exactly the Jupyter notebook. I even deployed the model to a Jetson Nano. Then I tried again with Resnet10. That also worked. It was only when I moved onto modifying the files and my own data set that things went wrong.

My default images are 1392x512px. Perhaps I have not specified the image_width and image_height correctly? I specified 1408 x 544. I wasn’t exactly sure how to set the image_width and image_height for training - but it did train.

Morganh · July 1, 2020, 7:24am

Thanks for the info.
So, please set to 1392x512 and trigger training.

augmentation_config {
preprocessing {
output_image_width: 1392
output_image_height: 512

Topic		Replies	Views
Unable to detect object after training TAO Toolkit	25	1072	October 12, 2021
Error while using Tlt-infer TAO Toolkit	6	717	October 12, 2021
Error on tlt-training detectnet_v2? TAO Toolkit	6	496	October 12, 2021
TLT Detectnet TrafficCamNet training not working TAO Toolkit	10	2518	October 12, 2021
Tlt-infer is slow TAO Toolkit	13	865	October 12, 2021
Training with TLT a detectnet_v2 resnet18 pre-trained model failed TAO Toolkit	2	626	October 12, 2021
Detectnet_v2 trained, tao infer can infer, but no results TAO Toolkit jetson-inference	7	582	October 23, 2023
Incorrect bounding box of detectnet_v2-darknet-53 in the inference phase TAO Toolkit	10	731	October 12, 2021
Core dump Illegal Instruction on detectnet_v2 example TAO Toolkit	17	2050	October 12, 2021
TLT detectnet_v2 set training width and height TAO Toolkit	16	899	October 12, 2021

Tlt-infer detectnet_v2 fails - TypeError

Running inference for detection on n images

Layer (type) Output Shape Param #

model_1 (Model) [(None, 3, 34, 88), (None 11203023

Running inference for detection on n images

Related topics