Hello, I’m trying to troubleshoot why my detectnet inference after training is awful. I had assumed it was my tfrecords, and this is making me further believe something is wrong with my tfrecords.
OUTPUT:
2020-01-21 21:55:47.611480: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_14/Cor[0/1293]ToCentroids/StridedReplace/strided_slice. Error: ValidateStridedSliceOp returned partial shapes [?,1] and [?,1]
2020-01-21 21:55:47.611529: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_14/CornerCoordToCentroids/strided_slice_4. Error: ValidateStridedSliceOp returned partial shapes [?,1] and [?]
2020-01-21 21:55:47.611567: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_14/CornerCoordToCentroids/strided_slice_5. Error: ValidateStridedSliceOp returned partial shapes [?,1] and [?]
2020-01-21 21:55:47.611607: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_14/CornerCoordToCentroids/strided_slice_6. Error: ValidateStridedSliceOp returned partial shapes [?,1] and [?]
2020-01-21 21:55:47.611642: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_14/CornerCoordToCentroids/strided_slice_7. Error: ValidateStridedSliceOp returned partial shapes [?,1] and [?]
2020-01-21 21:55:47.611677: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_14/CornerCoordToCentroids/StridedReplace/strided_slice_1. Error: ValidateStridedSliceOp returned partial shapes [?,0] and [?,0]
2020-01-21 21:55:47.611712: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_14/strided_slice_6. Error: ValidateStridedSliceOp returned partial shapes [?,4] and [?,4]
2020-01-21 21:55:47.611762: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_15/strided_slice_5. Error: ValidateStridedSliceOp returned partial shapes [?,1] and [?]
2020-01-21 21:55:47.611796: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_15/CornerCoordToCentroids/StridedReplace/strided_slice. Error: ValidateStridedSliceOp returned partial shapes [?,1] and [?,1]
2020-01-21 21:55:47.611832: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_15/CornerCoordToCentroids/strided_slice_4. Error: ValidateStridedSliceOp returned partial shapes [?,1] and [?]
2020-01-21 21:55:47.611866: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_15/CornerCoordToCentroids/strided_slice_5. Error: ValidateStridedSliceOp returned partial shapes [?,1] and [?]
2020-01-21 21:55:47.611902: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_15/CornerCoordToCentroids/strided_slice_6. Error: ValidateStridedSliceOp returned partial shapes [?,1] and [?]
2020-01-21 21:55:47.611934: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_15/CornerCoordToCentroids/strided_slice_7. Error: ValidateStridedSliceOp returned partial shapes [?,1] and [?]
2020-01-21 21:55:47.611967: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_15/CornerCoordToCentroids/StridedReplace/strided_slice_1. Error: ValidateStridedSliceOp returned partial shapes [?,0] and [?,0]
2020-01-21 21:55:47.612000: W ./tensorflow/core/grappler/optimizers/graph_optimizer_stage.h:241] Failed to run optimizer ArithmeticOptimizer, stage RemoveStackStridedSliceSameAxis node cond_15/strided_slice_6. Error: ValidateStridedSliceOp returned partial shapes [?,4] and [?,4]
2020-01-21 21:55:47.723128: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2020-01-21 21:55:47.754086: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x6f26910
2020-01-21 21:55:47.889884: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x6c45a70
2020-01-21 21:55:48.051511: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2020-01-21 21:55:48.122923: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x760b410
2020-01-21 21:55:48.463480: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x5d728c0
2020-01-21 21:55:49.809505: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2020-01-21 21:55:50.202033: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x6e9e420
ERROR:
2375/2633 [==========================>…] - ETA: 4:20 - loss: nan Batch 2374: Invalid loss, terminating training
Batch 2374: Invalid loss, terminating training
Batch 2374: Invalid loss, terminating training
Batch 2374: Invalid loss, terminating training
Batch 2374: Invalid loss, terminating training
Batch 2374: Invalid loss, terminating training
Batch 2374: Invalid loss, terminating training
Batch 2374: Invalid loss, terminating training
Epoch 00001: saving model to output/weights/ssd_resnet18_epoch_001.tlt
I will attach my training spec, tfrecord spec, and the commands I am running.
COMMANDS:
tlt-dataset-convert -d spec.txt -o ./tfrecord/
tlt-train ssd -e train_val.txt -r output -k <KEY_OMITTED_FOR_THIS_POST> -m /workspace/tlt-experiments/tlt_resnet18_ssd/resnet18.hdf5 --gpus 8
train_val.txt (3.11 KB)
spec.txt (300 Bytes)