SSD terminating training due to invalid loss

Hello, I’m trying to troubleshoot why my detectnet inference after training is awful. I had assumed it was my tfrecords, and this is making me further believe something is wrong with my tfrecords.

OUTPUT:
2020-01-21 21:55:47.611480: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_14/Cor[0/1293]ToCentroids/StridedReplace/strided_slice. Error: ValidateStridedSliceOp returned partial shapes [?,1] and [?,1]
2020-01-21 21:55:47.611529: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_14/CornerCoordToCentroids/strided_slice_4. Error: ValidateStridedSliceOp returned partial shapes [?,1] and [?]
2020-01-21 21:55:47.611567: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_14/CornerCoordToCentroids/strided_slice_5. Error: ValidateStridedSliceOp returned partial shapes [?,1] and [?]
2020-01-21 21:55:47.611607: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_14/CornerCoordToCentroids/strided_slice_6. Error: ValidateStridedSliceOp returned partial shapes [?,1] and [?]
2020-01-21 21:55:47.611642: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_14/CornerCoordToCentroids/strided_slice_7. Error: ValidateStridedSliceOp returned partial shapes [?,1] and [?]
2020-01-21 21:55:47.611677: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_14/CornerCoordToCentroids/StridedReplace/strided_slice_1. Error: ValidateStridedSliceOp returned partial shapes [?,0] and [?,0]
2020-01-21 21:55:47.611712: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_14/strided_slice_6. Error: ValidateStridedSliceOp returned partial shapes [?,4] and [?,4]
2020-01-21 21:55:47.611762: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_15/strided_slice_5. Error: ValidateStridedSliceOp returned partial shapes [?,1] and [?]
2020-01-21 21:55:47.611796: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_15/CornerCoordToCentroids/StridedReplace/strided_slice. Error: ValidateStridedSliceOp returned partial shapes [?,1] and [?,1]
2020-01-21 21:55:47.611832: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_15/CornerCoordToCentroids/strided_slice_4. Error: ValidateStridedSliceOp returned partial shapes [?,1] and [?]
2020-01-21 21:55:47.611866: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_15/CornerCoordToCentroids/strided_slice_5. Error: ValidateStridedSliceOp returned partial shapes [?,1] and [?]
2020-01-21 21:55:47.611902: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_15/CornerCoordToCentroids/strided_slice_6. Error: ValidateStridedSliceOp returned partial shapes [?,1] and [?]
2020-01-21 21:55:47.611934: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_15/CornerCoordToCentroids/strided_slice_7. Error: ValidateStridedSliceOp returned partial shapes [?,1] and [?]
2020-01-21 21:55:47.611967: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_15/CornerCoordToCentroids/StridedReplace/strided_slice_1. Error: ValidateStridedSliceOp returned partial shapes [?,0] and [?,0]
2020-01-21 21:55:47.612000: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_15/strided_slice_6. Error: ValidateStridedSliceOp returned partial shapes [?,4] and [?,4]
2020-01-21 21:55:47.723128: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2020-01-21 21:55:47.754086: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x6f26910
2020-01-21 21:55:47.889884: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x6c45a70
2020-01-21 21:55:48.051511: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2020-01-21 21:55:48.122923: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x760b410
2020-01-21 21:55:48.463480: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x5d728c0
2020-01-21 21:55:49.809505: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2020-01-21 21:55:50.202033: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x6e9e420

ERROR:
2375/2633 [==========================>…] - ETA: 4:20 - loss: nan Batch 2374: Invalid loss, terminating training
Batch 2374: Invalid loss, terminating training
Batch 2374: Invalid loss, terminating training
Batch 2374: Invalid loss, terminating training
Batch 2374: Invalid loss, terminating training
Batch 2374: Invalid loss, terminating training
Batch 2374: Invalid loss, terminating training
Batch 2374: Invalid loss, terminating training

Epoch 00001: saving model to output/weights/ssd_resnet18_epoch_001.tlt

I will attach my training spec, tfrecord spec, and the commands I am running.

COMMANDS:
tlt-dataset-convert -d spec.txt -o ./tfrecord/

tlt-train ssd -e train_val.txt -r output -k <KEY_OMITTED_FOR_THIS_POST> -m /workspace/tlt-experiments/tlt_resnet18_ssd/resnet18.hdf5 --gpus 8
train_val.txt (3.11 KB)
spec.txt (300 Bytes)

Hi kwindham ,
Could you please paste the full log when you run tlt-dataset-convert? Thanks.

I’ve attached the text file of the output.
output.txt (1.27 MB)

Sorry, here is the output from the tlt-dataset-convert command

tlt-dataset-convert -d spec.txt -o ./tfrecord/
OUTPUT:
Using TensorFlow backend.
2020-01-22 20:04:23,108 - iva.detectnet_v2.dataio.build_converter - INFO - Instantiating a kitti converter
2020-01-22 20:04:23,936 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Num images in
Train: 168524 Val: 42130
2020-01-22 20:04:23,936 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Validation data in partition 0. Hence, while choosing the validationset during training choose validation_fold 0.
2020-01-22 20:04:24,044 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 0
/usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:266: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
2020-01-22 20:04:36,038 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 1
2020-01-22 20:04:48,011 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 2
2020-01-22 20:04:59,950 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 3
2020-01-22 20:05:11,930 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 4
2020-01-22 20:05:23,895 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 5
2020-01-22 20:05:35,854 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 6
2020-01-22 20:05:47,846 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 7
2020-01-22 20:05:59,819 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 8
2020-01-22 20:06:11,746 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 9
2020-01-22 20:06:23,717 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
vehicle0: 297086
vehicle1: 318758
vehicle2: 9671
vehicle3: 14658
vehicle4: 71180
vehicle5: 7708
vehicle6: 106519
vehicle7: 36527
vehicle8: 58832
vehicle9: 34547
vehicle10: 206180
vehicle11: 112677
vehicle12: 20425
vehicle13: 8259
vehicle14: 24
vehicle15: 142809
vehicle16: 19740
vehicle17: 15363

2020-01-22 20:06:23,717 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 0
2020-01-22 20:07:11,546 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 1
2020-01-22 20:07:59,550 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 2
2020-01-22 20:08:47,414 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 3
2020-01-22 20:09:35,437 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 4
2020-01-22 20:10:23,497 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 5
2020-01-22 20:11:11,370 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 6
2020-01-22 20:11:59,226 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 7
2020-01-22 20:12:47,308 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 8
2020-01-22 20:13:35,484 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 9
2020-01-22 20:14:23,500 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
vehicle0: 1190116
vehicle1: 1273354
vehicle2: 38521
vehicle3: 58908
vehicle4: 284560
vehicle5: 30818
vehicle6: 425453
vehicle7: 144559
vehicle8: 234238
vehicle9: 138685
vehicle10: 826024
vehicle11: 451317
vehicle12: 82085
vehicle13: 32883
vehicle14: 42
vehicle15: 570141
vehicle16: 78636
vehicle17: 61599

2020-01-22 20:14:23,503 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Cumulative object statistics
2020-01-22 20:14:23,503 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
vehicle0: 1487202
vehicle1: 1592112
vehicle2: 48192
vehicle3: 73566
vehicle4: 355740
vehicle5: 38526
vehicle6: 531972
vehicle7: 181086
vehicle8: 293070
vehicle9: 173232
vehicle10: 1032204
vehicle11: 563994
vehicle12: 102510
vehicle13: 41142
vehicle14: 66
vehicle15: 712950
vehicle16: 98376
vehicle17: 76962

2020-01-22 20:14:23,503 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Class map.
Label in GT: Label in tfrecords file
vehicle0: vehicle0
vehicle1: vehicle1
vehicle2: vehicle2
vehicle3: vehicle3
vehicle4: vehicle4
vehicle5: vehicle5
vehicle6: vehicle6
vehicle7: vehicle7
vehicle8: vehicle8
vehicle9: vehicle9
vehicle10: vehicle10
vehicle11: vehicle11
vehicle12: vehicle12
vehicle13: vehicle13
vehicle14: vehicle14
vehicle15: vehicle15
vehicle16: vehicle16
vehicle17: vehicle17
2020-01-22 20:14:23,503 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Tfrecords generation complete.
For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap.

I will note that I do have 22 classes, but 4 of those classes have no observations in the label files. Could this cause conflict?

Sincerely,
-kwindham

Since there are only 18 classes in your tfrecords, please modify your previous training spec and retry.