Loss nan Error when tlt ssd train

hi below my error

Epoch 00001: saving model to /workspace/mobilenetv2/custom_ssd_mobilenetv2/result/weights/ssd_mobilenet_v2_epoch_001.tlt
Producing predictions: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 334/334 [00:21<00:00, 15.57it/s]
Start to calculate AP for each class
*******************************
person        AP    4e-05
              mAP   4e-05
*******************************
Validation loss: 103.49606128643977
Epoch 2/30
1164/1164 [==============================] - 470s 404ms/step - loss: 7.6423

Epoch 00002: saving model to /workspace/mobilenetv2/custom_ssd_mobilenetv2/result/weights/ssd_mobilenet_v2_epoch_002.tlt
Producing predictions: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 334/334 [00:20<00:00, 16.32it/s]
Start to calculate AP for each class
*******************************
person        AP    1e-05
              mAP   1e-05
*******************************
Validation loss: 105.26585391331506
Epoch 3/30
1164/1164 [==============================] - 466s 400ms/step - loss: 7.0734

Epoch 00003: saving model to /workspace/mobilenetv2/custom_ssd_mobilenetv2/result/weights/ssd_mobilenet_v2_epoch_003.tlt
Producing predictions: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 334/334 [00:20<00:00, 16.48it/s]
Start to calculate AP for each class
*******************************
person        AP    2e-05
              mAP   2e-05
*******************************
Validation loss: 95.36924472073372
Epoch 4/30
1164/1164 [==============================] - 462s 397ms/step - loss: 7.0987

Epoch 00004: saving model to /workspace/mobilenetv2/custom_ssd_mobilenetv2/result/weights/ssd_mobilenet_v2_epoch_004.tlt
Producing predictions: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 334/334 [00:20<00:00, 16.28it/s]
Start to calculate AP for each class
*******************************
person        AP    0.00046
              mAP   0.00046
*******************************
Validation loss: 817.7595584580349
Epoch 5/30
1164/1164 [==============================] - 467s 401ms/step - loss: 101.5039

Epoch 00005: saving model to /workspace/mobilenetv2/custom_ssd_mobilenetv2/result/weights/ssd_mobilenet_v2_epoch_005.tlt
Producing predictions: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████| 334/334 [00:20<00:00, 16.25it/s]
Start to calculate AP for each class
*******************************
person        AP    3e-05
              mAP   3e-05
*******************************
Validation loss: 2493.6795383461954
Epoch 6/30
 100/1164 [=>............................] - ETA: 6:04 - loss: nan                   Batch 99: Invalid loss, terminating training

here is my specs file
ssd_train_mobilenetv2_kittt.txt (1.4 KB)

and my code

$ tlt ssd train -e /workspace/mobilenetv2/custom_ssd_mobilenetv2/specs/ssd_train_mobilenetv2_kitti.txt -r /workspace/mobilenetv2/custom_ssd_mobilenetv2/result/ -k tlt_encode --gpus 1

my dataset info

  • 300x300 jpeg images
  • KITTI annotations

I tested it by changing the batch size, but the result was similar.

which part should I modify?

help me.

thanks.

Please try to set a larger max_lr or smaller bs.