Tlt-infer detectnet_v2 fails - TypeError

Hi. I am trying to use detectnet_v2 resnet18 pre-trained model with TLT. The notebook has so far given reasonable results, given I haven’t really tried to configure very much. It’s an experiment with 3 lego vehicles on a table.

The dataset is small, as I’m trying to get everything working before investing a lot of time on creating a larger dataset. Currently I have 70 training images and 6 testing images.

Training, based on resnet18_detector gave good results. Pruning and retraining ran without incident.

Validation cost: 0.000396
Mean average_precision (in %): 70.6779

class name average precision (in %)


bus 77.4306
followme 45.7143
police 88.8889

The problem came when I tried to run tlt-infer.

Running inference for detection on n images

!tlt-infer detectnet_v2 -e $SPECS_DIR/detectnet_v2_inference_kitti_tlt_lego.txt
-o $USER_EXPERIMENT_DIR/tlt_infer_testing
-i $DATA_DOWNLOAD_DIR/testing/image_2
-k $KEY

Any/all help would be greatly appreciated. Apologies in advanced if insufficient information has been given. Below is the result of the tlt-infer command.

Using TensorFlow backend.
2020-06-29 11:31:29,590 [INFO] iva.detectnet_v2.scripts.inference: Creating output inference directory
2020-06-29 11:31:29,590 [INFO] iva.detectnet_v2.scripts.inference: Overlain images will be saved in the output path.
2020-06-29 11:31:29,591 [INFO] iva.detectnet_v2.inferencer.build_inferencer: Constructing inferencer
2020-06-29 11:31:29,887 [INFO] iva.detectnet_v2.inferencer.tlt_inferencer: Loading model from /workspace/tlt-experiments/detectnet_v2/experiment_dir_retrain/weights/resnet18_detector_pruned.tlt:


Layer (type) Output Shape Param #

input_1 (InputLayer) (None, 3, 544, 1408) 0


model_1 (Model) [(None, 3, 34, 88), (None 11203023

Total params: 11,203,023
Trainable params: 11,193,295
Non-trainable params: 9,728


2020-06-29 11:31:32,852 [INFO] iva.detectnet_v2.scripts.inference: Initialized model
2020-06-29 11:31:32,852 [INFO] iva.detectnet_v2.scripts.inference: Commencing inference
0%| | 0/2 [00:00<?, ?it/s]Process PoolWorker-1:
Traceback (most recent call last):
File “/usr/lib/python2.7/multiprocessing/process.py”, line 258, in _bootstrap
self.run()
File “/usr/lib/python2.7/multiprocessing/process.py”, line 114, in run
self._target(*self._args, **self._kwargs)
File “/usr/lib/python2.7/multiprocessing/pool.py”, line 102, in worker
task = get()
File “/usr/lib/python2.7/multiprocessing/queues.py”, line 378, in get
return recv()
TypeError: new() takes exactly 4 arguments (2 given)
Process PoolWorker-2:
Traceback (most recent call last):
File “/usr/lib/python2.7/multiprocessing/process.py”, line 258, in _bootstrap
self.run()
File “/usr/lib/python2.7/multiprocessing/process.py”, line 114, in run
self._target(*self._args, **self._kwargs)
File “/usr/lib/python2.7/multiprocessing/pool.py”, line 102, in worker
task = get()
File “/usr/lib/python2.7/multiprocessing/queues.py”, line 378, in get
return recv()
TypeError: new() takes exactly 4 arguments (2 given)
Process PoolWorker-3:
Traceback (most recent call last):
File “/usr/lib/python2.7/multiprocessing/process.py”, line 258, in _bootstrap
self.run()
File “/usr/lib/python2.7/multiprocessing/process.py”, line 114, in run
self._target(*self._args, **self._kwargs)
File “/usr/lib/python2.7/multiprocessing/pool.py”, line 102, in worker
task = get()
File “/usr/lib/python2.7/multiprocessing/queues.py”, line 378, in get
return recv()
TypeError: new() takes exactly 4 arguments (2 given)
Process PoolWorker-4:
Traceback (most recent call last):
File “/usr/lib/python2.7/multiprocessing/process.py”, line 258, in _bootstrap
self.run()
File “/usr/lib/python2.7/multiprocessing/process.py”, line 114, in run
self._target(*self._args, **self._kwargs)
File “/usr/lib/python2.7/multiprocessing/pool.py”, line 102, in worker
task = get()
File “/usr/lib/python2.7/multiprocessing/queues.py”, line 378, in get
return recv()
TypeError: new() takes exactly 4 arguments (2 given)

Could you please paste your $SPECS_DIR/detectnet_v2_inference_kitti_tlt_lego.txt?

Thanks for your assistance. Here is the requested file.

detectnet_v2_inference_kitti_tlt_lego.txt (2.2 KB)

Seems that there is not any wrong in your spec.
How many gpus in your host pc? 4?
More, would you please share more log after “Using TensorFlow backend.”?
There should be some log of your gpus.

Hi. Thanks again for your quick response.

I have a 1 GPU laptop as per the spec below. I have attached all output from the Jupyter notebook.

After_Using_TensorFlow_backend.txt (3.3 KB)

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
±----------------------------------------------------------------------------+
Tue Jun 30 13:08:02 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2080 Off | 00000000:01:00.0 Off | N/A |
| N/A 57C P2 50W / N/A | 5197MiB / 7982MiB | 5% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
±----------------------------------------------------------------------------+

Hi. Not sure if this helps, but I also tried a single image with very similar results.

I ran the following.

Running inference for detection on n images

!tlt-infer detectnet_v2 -e $SPECS_DIR/detectnet_v2_inference_kitti_tlt_lego.txt
-o $USER_EXPERIMENT_DIR/tlt_infer_testing
-i /workspace/tlt-experiments/data/testing/image_2/lego1-74.png
-k $KEY

Result was:

Very strange. Could you attach the full log of your training?
More, when run tlt-infer, there should be some info about gpu as below. But your log is missing.

Using TensorFlow backend.
2020-06-30 09:09:52,763 [INFO] iva.detectnet_v2.scripts.inference: Overlain images will be saved in the output path.
2020-06-30 09:09:52,763 [INFO] iva.detectnet_v2.inferencer.build_inferencer: Constructing inferencer
2020-06-30 09:09:52.764218: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-06-30 09:09:52.830187: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x71b1400 executing computations on platform CUDA. Devices:
2020-06-30 09:09:52.830269: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2020-06-30 09:09:52.850890: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3499910000 Hz
2020-06-30 09:09:52.851949: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x721b250 executing computations on platform Host. Devices:
2020-06-30 09:09:52.852006: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): ,
2020-06-30 09:09:52.852235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:65:00.0
totalMemory: 10.91GiB freeMemory: 10.15GiB
2020-06-30 09:09:52.852280: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-06-30 09:09:53.078357: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-30 09:09:53.078416: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2020-06-30 09:09:53.078423: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2020-06-30 09:09:53.078536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9800 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1080 Ti, pci bus id: 0000:65:00.0, compute capability: 6.1)
2020-06-30 09:09:53,079 [INFO] iva.detectnet_v2.inferencer.tlt_inferencer: Loading model from /workspace/tlt-experiments/detectnet_v2/experiment_dir_unpruned/weights/resnet18_detector.tlt:

Hi Morganh, I am re-running the training to get the training log. However there is no more info about my GPU as per your output for the tnt-infer step. Perhaps I misunderstand what you mean by log.

Is the log simply the output in the Jupyter notebook, or do I need to turn on a switch to get more verbose output? I have given you all that was output from the Jupyter notebook. I fear I’m missing something obvious. Can you give me a few more hints regarding the log?

No switch is needed.
Can you post the training log?
Normally there should be some log about you device as above.

More, please provide the cpu info of your laptop.

root@b4f2ef224cd3:/workspace/tlt-experiments/data# nvidia-smi
Wed Jul 1 06:33:57 2020
±----------------------------------------------------------------------------+
| NVIDIA-SMI 440.100 Driver Version: 440.100 CUDA Version: 10.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce RTX 2080 Off | 00000000:01:00.0 Off | N/A |
| N/A 81C P2 199W / N/A | 5151MiB / 7982MiB | 92% Default |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
±----------------------------------------------------------------------------+

training_log.txt (57.3 KB)

Please share the CPU info too. Thanks.
$ cat /proc/cpuinfo

Here’s the CPU output. Sorry, misread your earlier comment.

cpuinfo.txt (15.3 KB)

Retrain log. This is the model used for inference. I cannot see any reference to GPU however.

retraining_log.txt (57.2 KB)

From your training log, smilar log as below is not available. This maybe the culprit.

2020-06-30 09:09:52.764218: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-06-30 09:09:52.830187: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x71b1400 executing computations on platform CUDA. Devices:
2020-06-30 09:09:52.830269: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce GTX 1080 Ti, Compute Capability 6.1
2020-06-30 09:09:52.850890: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3499910000 Hz
2020-06-30 09:09:52.851949: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x721b250 executing computations on platform Host. Devices:
2020-06-30 09:09:52.852006: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): ,
2020-06-30 09:09:52.852235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582
pciBusID: 0000:65:00.0
totalMemory: 10.91GiB freeMemory: 10.15GiB
2020-06-30 09:09:52.852280: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-06-30 09:09:53.078357: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-06-30 09:09:53.078416: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2020-06-30 09:09:53.078423: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2020-06-30 09:09:53.078536: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/

Could you double check if your laptop meets software requirement?
https://docs.nvidia.com/metropolis/TLT/tlt-getting-started-guide/index.html#requirements

Software Requirements

Ubuntu 18.04 LTS
NVIDIA GPU Cloud account and API key - https://ngc.nvidia.com/
docker-ce installed, https://docs.docker.com/install/linux/docker-ce/ubuntu/
nvidia-docker2 installed, instructions: https://github.com/nvidia/nvidia-docker/wiki/Installation-(version-2.0)
NVIDIA GPU driver v410.xx or above

Hi again.

My system is Linux Mint, but based on Ubuntu 18.04 bionic. Attached detailed system information.

When installing Nvidia Docker 2, I registered the Ubuntu repositories manually as per this post: https://github.com/NVIDIA/nvidia-docker/issues/848

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/ubuntu18.04/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update

SystemInfo.txt (4.5 KB)

I would like to add that the first time I ran through with the normal KITTI dataset as in the default TLT Jupyter notebook. Everything worked fine.

So, do you mean the tlt-infer is also fine at that time?

Yes. The first time I did this procedure I just followed exactly the Jupyter notebook. I even deployed the model to a Jetson Nano. Then I tried again with Resnet10. That also worked. It was only when I moved onto modifying the files and my own data set that things went wrong.

My default images are 1392x512px. Perhaps I have not specified the image_width and image_height correctly? I specified 1408 x 544. I wasn’t exactly sure how to set the image_width and image_height for training - but it did train.

Thanks for the info.
So, please set to 1392x512 and trigger training.

augmentation_config {
preprocessing {
output_image_width: 1392
output_image_height: 512