SSD loss possible int overflow on one computer, but not the other

Computer 1 specs:
8x GeForce GTX 1080
NVIDIA-SMI: 440.44

Command to start docker:
sudo docker run --gpus all --rm -it -v /data/DeepLearning:/workspace/tlt-experiments nvcr.io/nvidia/tlt-streamanalytics:v1.0_py2 /bin/bash

Command to start SSD training:
tlt-train ssd -e computer1_train_val.txt -r output -k <KEY_OMITTED_FOR_POST> -m /workspace/tlt-experiments/tlt_resnet18_ssd/resnet18.hdf5 --gpus 8

Computer 2 specs:
4x TITAN X (Pascal)
NVIDIA-SMI: 410.48

Command to start docker:
sudo docker run --runtime=nvidia --rm -it -v /data/DeepLearning:/workspace/tlt-experiments nvcr.io/nvidia/tlt-streamanalytics:v1.0_py2 /bin/bash

Command to start SSD training:
tlt-train ssd -e computer2_train_val.txt -r output -k <KEY_OMITTED_FOR_POST> -m /workspace/tlt-experiments/tlt_resnet18_ssd/resnet18.hdf5 --gpus 4

Additional Notes:
I thought perhaps my kitti folder on Computer1 might be corrupt, so after running once on Computer1 I got the computer1_terminal_out_first_run.txt output. I then copied over the kitti directory from computer2 and ran again on computer1 (after of course doing yet another tfrecord convert on the new directory - I have also attached this spec file as spec.txt). On the second run, the same issue happened again shown in computer1_terminal_out_second_run.txt. As you can see the commands are incredibly similar as are the training files. What is going on?

TFRECORD Convert Command:
tlt-dataset-convert -d spec.txt -o ./tfrecord/

EDIT:
According to this post: https://github.com/ragulpr/wtte-rnn-examples/issues/2
There is an issue where someone who committed code to a repo may have led to problems. Is this a known problem in CUDA 10.0?

Also, I tried to pull the latest docker that I found here: https://ngc.nvidia.com/catalog and tried to run it, but I’m still getting a negative loss that blows up into a large number. Here is the docker I tried: nvcr.io/nvidia/tlt-streamanalytics:v1.0.1_py2 – could this be a driver issue?

I’ve now tried to downgrade to nvidia-docker2 (which is what computer2 is using) and I’m still having the same issue.
computer1_train_val.txt (2.82 KB)
computer1_terminal_out_first_run.txt (512 KB)
computer1_terminal_out_second_run.txt (440 KB)
computer2_train_val.txt (2.82 KB)
computer2_terminal_out.txt (662 KB)
spec.txt (292 Bytes)

Hi kwindham,
Since your computer2 works but computer1 does not work, can you try below experiments in your computer1 to narrow down? Thanks.

  1. Use the same command to start docker.
  2. Use the same gpus’ quantity as computer2.

Hi Morganh,

I tried both of these troubleshooting steps with the same results. I have posted the output to this message. I will note my docker versions are different.

Computer1
Docker version: 19.03
API version: 1.40

Computer2
Docker version: 18.09.6
API version: 1.39

Sincerely,
kwindham

Hi kwindham,
Could you please trigger a cross checking in computer1?
You can use the default training spec in the docker, then train the default dataset, i.e, KITTI dataset.
To see if it works with KITTI dataset.

Hi Morganh,

What do you mean trigger a cross checking in computer1?

I’ve tried my best to replicate everything from Computer2 to Computer1. I have installed nVidia driver 410.129 with CUDA 10.0 and even installed the older version of the docker container. I am still getting loss calculations that eventually hit a NaN regardless of these changes. I’m running out of ideas.

Where can I download the default KITTI dataset to test this? Last time I went to the official kitti dataset the labels did not match the images, do you have a direct download link to avoid confusion?

Sincerely,
kwindham

I mean “In computer 1, you can use the default training spec in the docker, then train the default dataset, i.e, KITTI dataset.”

In TLT container, there are 4 Jupyter notebooks.
You can trigger the SSD notebook and follow the steps to run KITTI dataset.

See more info about how to trigger notebook, https://docs.nvidia.com/metropolis/TLT/tlt-getting-started-guide/index.html#installing_magnet_topic

Hi Morganh,

I just tried the jupyter notebook example for SSD and I’m getting the same result. I went through each block and placed all of the files in the appropriate volumes and had success doing the tfrecord conversion. I’ve attached a text file that contains the output from the 14th block.

Any other ideas?

Sincerely,
kwindham
output.txt (124 KB)

Hi Morganh,

So I went to the 15th block and everything was looking great, until it hit a NaN again. I’ve attached the output as it is novel for this computer to get through multiple epochs with decreasing loss. Perhaps there is a clue here that may help you help me resolve this issue. I’ve attached the output of the multi-GPU output.

Here is the command in the 15th block of the jupyter notebook:
print(“For multi-GPU, please uncomment and run this instead. Change --gpus based on your machine.”)
!tlt-train ssd -e $SPECS_DIR/ssd_train_resnet18_kitti.txt
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
-m $USER_EXPERIMENT_DIR/pretrained_resnet18/tlt_resnet18_ssd_v1/resnet18.hdf5
–gpus 8

Sincerely,
kwindham
multiGPUoutput.txt (791 KB)

What’s your CPUs information in computer1?
Possibly you are using an incompatible type of CPU that the TensorFlow package in TLT container does not support.

More, please check computer1 about the requirement.
https://docs.nvidia.com/metropolis/TLT/tlt-getting-started-guide/index.html#requirements

Hi Morganh,

Thanks for your help and patience with this. I’ve attached my /proc/cpuinfo output. I believe it meets the requirements listed at the link you sent me.

Sincerely,
kwindham
cpuinfo.txt (51.9 KB)

Hi kwindham,
In computer1, may I know if it can run other TLT jupyter notebooks(detectnet_v2 and faster-rcnn) well?

Hi Morganh,

Bizarrely detectnet worked and the loss didn’t blow up. I’ll be trying faster-rcnn next too. I’ve been getting better results with SSD on the other computer though and would still like to get it to work. See my attached output with detectnet train command. Also, the mAP seems low on the output, is that roughly what I should expect?

Sincerely,
-kwindham
jupyterNoteBookOutput_Detectnet.txt (2.86 MB)

Hi kwindham,
From your result, it proves your computer 1 can work with detectnet_v2 network. So I think the computer 1 can meet TLT requirement. For mAP, since you are running with 8 gpus, so it needs some changing for bs/max_lr/min_lr in the training spec.That’s another topic.