Tlt-train with ssd is not working on the latest container (December 29, 2020)

Hi, I just updated my docker tlt container to the latest release and started to get the following error when using tlt-train with ssd and resnet 18 model

I converted my data using tlt-dataset-convert

training_spec (1.7 KB)

I am getting an error as follows:

Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 45, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py”, line 248, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/ssd/scripts/train.py”, line 185, in run_experiment
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training.py”, line 1039, in fit
validation_steps=validation_steps)
File “/usr/local/lib/python3.6/dist-packages/keras/engine/training_arrays.py”, line 154, in fit_loop
outs = f(ins)
File “/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py”, line 2715, in call
return self._call(inputs)
File “/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py”, line 2671, in _call
session)
File “/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py”, line 2623, in _make_callable
callable_fn = session._make_callable_from_options(callable_opts)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1505, in _make_callable_from_options
return BaseSession._Callable(self, callable_options)
File “/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py”, line 1460, in init
session._session, options_ptr)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Conv2DCustomBackpropInputOp only supports NHWC.
[[{{node training/SGD/gradients/ssd_loc_0_1/convolution_grad/Conv2DBackpropInput}}]]

The same command and spec file used to work on the previous container released on 08/04/2020.

I am not sure if it is related to my CUDA version, NVIDIA driver or something else. Here is the output of nvidia-smi

nvidia-smi
Thu Jan 21 17:36:11 2021
±----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------±---------------------±---------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro P4000 Off | 00000000:01:00.0 Off | N/A |
| 49% 39C P0 28W / 105W | 0MiB / 8103MiB | 1% Default |
| | | N/A |
±------------------------------±---------------------±---------------------+

±----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
±----------------------------------------------------------------------------+

Which tlt container did you use?

nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3

See Transfer Learning Toolkit for Video Streaming Analytics | NVIDIA NGC
The nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3 is released on 08/04/2020. It should be working without issue.
For the version you downloaded on December 29th, it is not an official one. Please ignore it.

I used this command to pull the container

docker pull nvcr.io/nvidia/tlt-streamanalytics:v2.0_py3

So, apparently I am still having the one released on 08/04/2020

sudo docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
nvcr.io/nvidia/tlt-streamanalytics v2.0_py3 eefcf17a7830 5 months ago 7.15GB

It used to work without any issues when I first experimented with it 5 months ago, but since then the machine has been through several upgrades and driver updates, and I am not sure what happened!. I tried the same spec files and dataset I converted 5 months ago when it was working and still shows the same error…

Do I need specific CUDA version? or NVIDIA driver?

Firstly, to narrow down, please try to run default Jupter notebook with public KITTI dataset instead of your own dataset.

I’m asking this because one user met the same issue TLT Detectnet TrafficCamNet training not working

You can also search key word in this TLT forum.
For example, after searching, I get this list.
https://forums.developer.nvidia.com/search?q=supports%20NHWC%20%23intelligent-video-analytics%3Atransfer-learning-toolkit%20

I saw all the suggested topics, and unfortunately the solutions suggested there are not suitable for me as I am running on local machine (not AWS) and I believe my GPU, and NVIDIA driver are fine and should work with TLT. In fact, 5 months ago it used to work just fine and I trained more than 50 models with it. I will try the Jupyter notebook and get back to you.

My doubt is that there is problem with my CUDA version. For some reason when I type nvidia-smi, it shows CUDA 11.2 which is not compatible with tensorflow 1.15

For 2.0_py3 version, it does not run any test on CUDA 11.2 . So, not sure its behavior under CUDA11.2.
That docker is compatible with CUDA10 for sure.
But you can still have a try under CUDA11.2.