Tlt-train with ssd is not working on the latest container (December 29, 2020)

jalalirs · January 21, 2021, 5:37pm

Hi, I just updated my docker tlt container to the latest release and started to get the following error when using tlt-train with ssd and resnet 18 model

I converted my data using tlt-dataset-convert

training_spec (1.7 KB)

I am getting an error as follows:

Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 45, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py”, line 248, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py”, line 185, in run_experiment
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training.py”, line 1039, in fit
validation_steps=validation_steps)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training_arrays.py”, line 154, in fit_loop
outs = f(ins)
File “/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py”, line 2715, in call
return self._call(inputs)
File “/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py”, line 2671, in _call
session)
File “/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py”, line 2623, in _make_callable
callable_fn = session._make_callable_from_options(callable_opts)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1505, in _make_callable_from_options
return BaseSession._Callable(self, callable_options)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1460, in init
session._session, options_ptr)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Conv2DCustomBackpropInputOp only supports NHWC.
[[{{node training/SGD/gradients/ssd_loc_0_1/convolution_grad/Conv2DBackpropInput}}]]

The same command and spec file used to work on the previous container released on 08/04/2020.

I am not sure if it is related to my CUDA version, NVIDIA driver or something else. Here is the output of nvidia-smi

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

Morganh · January 22, 2021, 12:46am

Which tlt container did you use?

jalalirs · January 22, 2021, 4:36am

nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3

Morganh · January 22, 2021, 6:03am

See Transfer Learning Toolkit for Video Streaming Analytics | NVIDIA NGC
The nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3 is released on 08/04/2020. It should be working without issue.
For the version you downloaded on December 29th, it is not an official one. Please ignore it.

jalalirs · January 22, 2021, 10:37am

I used this command to pull the container

docker pull nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3

So, apparently I am still having the one released on 08/04/2020

sudo docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
nvcr.io/nvidia/tlt-streamanalytics v2.0_py3 eefcf17a7830 5 months ago 7.15GB

It used to work without any issues when I first experimented with it 5 months ago, but since then the machine has been through several upgrades and driver updates, and I am not sure what happened!. I tried the same spec files and dataset I converted 5 months ago when it was working and still shows the same error…

Do I need specific CUDA version? or NVIDIA driver?

Morganh · January 22, 2021, 10:47am

Firstly, to narrow down, please try to run default Jupter notebook with public KITTI dataset instead of your own dataset.

I’m asking this because one user met the same issue TLT Detectnet TrafficCamNet training not working

Morganh · January 22, 2021, 10:50am

You can also search key word in this TLT forum.
For example, after searching, I get this list.
https://forums.developer.nvidia.com/search?q=supports%20NHWC%20%23intelligent-video-analytics%3Atransfer-learning-toolkit%20

jalalirs · January 22, 2021, 10:59am

I saw all the suggested topics, and unfortunately the solutions suggested there are not suitable for me as I am running on local machine (not AWS) and I believe my GPU, and NVIDIA driver are fine and should work with TLT. In fact, 5 months ago it used to work just fine and I trained more than 50 models with it. I will try the Jupyter notebook and get back to you.

My doubt is that there is problem with my CUDA version. For some reason when I type nvidia-smi, it shows CUDA 11.2 which is not compatible with tensorflow 1.15

Morganh · January 22, 2021, 11:03am

For 2.0_py3 version, it does not run any test on CUDA 11.2 . So, not sure its behavior under CUDA11.2.
That docker is compatible with CUDA10 for sure.
But you can still have a try under CUDA11.2.

Topic		Replies	Views
Docker instantiation failed when running tao ssd TAO Toolkit	17	1037	December 28, 2021
Error while training on tlt TAO Toolkit	4	767	September 5, 2021
Docker instantiation failed with error: 500 Server Error: Internal Server Error ("OCI runtime create failed...) TAO Toolkit ubuntu , docker	51	9072	December 6, 2021
tlt-export error TAO Toolkit	3	1640	October 12, 2021
TLT Detectnet TrafficCamNet training not working TAO Toolkit	10	2570	October 12, 2021
Error wile using TLT pretrained model tlt_semantic_segmentation:resnet101 TAO Toolkit	7	664	August 27, 2021
Tlt cuda 10 version TAO Toolkit	5	960	October 12, 2021
TLT V2.0 Classification TAO Toolkit	26	2983	August 3, 2021
Training Peoplent on custom data TAO Toolkit ai-training	19	2902	September 11, 2021
TLT DetectnetV2, Problem (Solved! -> RTX 3070 not supported by tlt 2.0_py3) TAO Toolkit	12	779	October 12, 2021

Tlt-train with ssd is not working on the latest container (December 29, 2020)

Related topics