Additional Notes:
I thought perhaps my kitti folder on Computer1 might be corrupt, so after running once on Computer1 I got the computer1_terminal_out_first_run.txt output. I then copied over the kitti directory from computer2 and ran again on computer1 (after of course doing yet another tfrecord convert on the new directory - I have also attached this spec file as spec.txt). On the second run, the same issue happened again shown in computer1_terminal_out_second_run.txt. As you can see the commands are incredibly similar as are the training files. What is going on?
I tried both of these troubleshooting steps with the same results. I have posted the output to this message. I will note my docker versions are different.
Computer1
Docker version: 19.03
API version: 1.40
Computer2
Docker version: 18.09.6
API version: 1.39
Hi kwindham,
Could you please trigger a cross checking in computer1?
You can use the default training spec in the docker, then train the default dataset, i.e, KITTI dataset.
To see if it works with KITTI dataset.
What do you mean trigger a cross checking in computer1?
I’ve tried my best to replicate everything from Computer2 to Computer1. I have installed nVidia driver 410.129 with CUDA 10.0 and even installed the older version of the docker container. I am still getting loss calculations that eventually hit a NaN regardless of these changes. I’m running out of ideas.
Where can I download the default KITTI dataset to test this? Last time I went to the official kitti dataset the labels did not match the images, do you have a direct download link to avoid confusion?
I just tried the jupyter notebook example for SSD and I’m getting the same result. I went through each block and placed all of the files in the appropriate volumes and had success doing the tfrecord conversion. I’ve attached a text file that contains the output from the 14th block.
So I went to the 15th block and everything was looking great, until it hit a NaN again. I’ve attached the output as it is novel for this computer to get through multiple epochs with decreasing loss. Perhaps there is a clue here that may help you help me resolve this issue. I’ve attached the output of the multi-GPU output.
Here is the command in the 15th block of the jupyter notebook:
print(“For multi-GPU, please uncomment and run this instead. Change --gpus based on your machine.”)
!tlt-train ssd -e $SPECS_DIR/ssd_train_resnet18_kitti.txt
-r $USER_EXPERIMENT_DIR/experiment_dir_unpruned
-k $KEY
-m $USER_EXPERIMENT_DIR/pretrained_resnet18/tlt_resnet18_ssd_v1/resnet18.hdf5
–gpus 8
What’s your CPUs information in computer1?
Possibly you are using an incompatible type of CPU that the TensorFlow package in TLT container does not support.
Thanks for your help and patience with this. I’ve attached my /proc/cpuinfo output. I believe it meets the requirements listed at the link you sent me.
Bizarrely detectnet worked and the loss didn’t blow up. I’ll be trying faster-rcnn next too. I’ve been getting better results with SSD on the other computer though and would still like to get it to work. See my attached output with detectnet train command. Also, the mAP seems low on the output, is that roughly what I should expect?
Hi kwindham,
From your result, it proves your computer 1 can work with detectnet_v2 network. So I think the computer 1 can meet TLT requirement. For mAP, since you are running with 8 gpus, so it needs some changing for bs/max_lr/min_lr in the training spec.That’s another topic.