Training with TLT a detectnet_v2 resnet18 pre-trained model failed

yvainrouchaud24800 · March 9, 2020, 10:24am

Hi, I am trying to train a detectnet_v2 resnet18 pre-trained model with transfer learning toolkit, I do not understand the log I have on the ‘‘step0’’ it seems that there is missing some images but when I check the folder it is at the right place !

here it is :
tlt-train detectnet_v2 -e /workspace/tlt-experiments/specs_files/spec_file -r /workspace/tlt-experiments/trained_models -n tlt_resnet18_detectnet_v2_v1 -k bDBwdWFsb2g4YTNvdWdnbTVhdnQ3cWpqZToyNDU2MTg3Ny02NjE5LTQ1NWEtODg5Mi1iZTg1YmY2NDc2NmQ
Using TensorFlow backend.
2020-03-09 10:09:11.304210: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-03-09 10:09:11.386041: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:998] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-03-09 10:09:11.387408: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x64f2570 executing computations on platform CUDA. Devices:
2020-03-09 10:09:11.387434: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): GeForce GTX 970M, Compute Capability 5.2
2020-03-09 10:09:11.410126: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2494110000 Hz
2020-03-09 10:09:11.410945: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x660a730 executing computations on platform Host. Devices:
2020-03-09 10:09:11.410998: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): ,
2020-03-09 10:09:11.411272: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce GTX 970M major: 5 minor: 2 memoryClockRate(GHz): 1.038
pciBusID: 0000:01:00.0
totalMemory: 2.95GiB freeMemory: 2.55GiB
2020-03-09 10:09:11.411336: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-03-09 10:09:11.412916: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-03-09 10:09:11.412960: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2020-03-09 10:09:11.412988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2020-03-09 10:09:11.413155: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2316 MB memory) → physical GPU (device: 0, name: GeForce GTX 970M, pci bus id: 0000:01:00.0, compute capability: 5.2)
2020-03-09 10:09:11,414 [INFO] iva.detectnet_v2.scripts.train: Loading experiment spec at /workspace/tlt-experiments/specs_files/spec_file.
2020-03-09 10:09:11,416 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from /workspace/tlt-experiments/specs_files/spec_file
WARNING:tensorflow:From ./detectnet_v2/dataloader/utilities.py:114: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and:
tf.data.TFRecordDataset(path)
2020-03-09 10:09:11,430 [WARNING] tensorflow: From ./detectnet_v2/dataloader/utilities.py:114: tf_record_iterator (from tensorflow.python.lib.io.tf_record) is deprecated and will be removed in a future version.
Instructions for updating:
Use eager execution and:
tf.data.TFRecordDataset(path)
2020-03-09 10:09:11,543 [INFO] iva.detectnet_v2.scripts.train: Cannot iterate over exactly 5985 samples with a batch size of 16; each epoch will therefore take one extra step.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2020-03-09 10:09:11,550 [WARNING] tensorflow: From /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From /usr/local/lib/python2.7/dist-packages/horovod/tensorflow/init.py:91: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
2020-03-09 10:09:11,565 [WARNING] tensorflow: From /usr/local/lib/python2.7/dist-packages/horovod/tensorflow/init.py:91: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.

Layer (type) Output Shape Param # Connected to

input_1 (InputLayer) (None, 3, 370, 1224) 0

conv1 (Conv2D) (None, 64, 185, 612) 9472 input_1[0][0]

bn_conv1 (BatchNormalization) (None, 64, 185, 612) 256 conv1[0][0]

activation_1 (Activation) (None, 64, 185, 612) 0 bn_conv1[0][0]

block_1a_conv_1 (Conv2D) (None, 64, 93, 306) 36928 activation_1[0][0]

block_1a_bn_1 (BatchNormalizati (None, 64, 93, 306) 256 block_1a_conv_1[0][0]

activation_2 (Activation) (None, 64, 93, 306) 0 block_1a_bn_1[0][0]

block_1a_conv_2 (Conv2D) (None, 64, 93, 306) 36928 activation_2[0][0]

block_1a_conv_shortcut (Conv2D) (None, 64, 93, 306) 4160 activation_1[0][0]

block_1a_bn_2 (BatchNormalizati (None, 64, 93, 306) 256 block_1a_conv_2[0][0]

block_1a_bn_shortcut (BatchNorm (None, 64, 93, 306) 256 block_1a_conv_shortcut[0][0]

add_1 (Add) (None, 64, 93, 306) 0 block_1a_bn_2[0][0]
block_1a_bn_shortcut[0][0]

activation_3 (Activation) (None, 64, 93, 306) 0 add_1[0][0]

block_1b_conv_1 (Conv2D) (None, 64, 93, 306) 36928 activation_3[0][0]

block_1b_bn_1 (BatchNormalizati (None, 64, 93, 306) 256 block_1b_conv_1[0][0]

activation_4 (Activation) (None, 64, 93, 306) 0 block_1b_bn_1[0][0]

block_1b_conv_2 (Conv2D) (None, 64, 93, 306) 36928 activation_4[0][0]

block_1b_conv_shortcut (Conv2D) (None, 64, 93, 306) 4160 activation_3[0][0]

block_1b_bn_2 (BatchNormalizati (None, 64, 93, 306) 256 block_1b_conv_2[0][0]

block_1b_bn_shortcut (BatchNorm (None, 64, 93, 306) 256 block_1b_conv_shortcut[0][0]

add_2 (Add) (None, 64, 93, 306) 0 block_1b_bn_2[0][0]
block_1b_bn_shortcut[0][0]

activation_5 (Activation) (None, 64, 93, 306) 0 add_2[0][0]

block_2a_conv_1 (Conv2D) (None, 128, 47, 153) 73856 activation_5[0][0]

block_2a_bn_1 (BatchNormalizati (None, 128, 47, 153) 512 block_2a_conv_1[0][0]

activation_6 (Activation) (None, 128, 47, 153) 0 block_2a_bn_1[0][0]

block_2a_conv_2 (Conv2D) (None, 128, 47, 153) 147584 activation_6[0][0]

block_2a_conv_shortcut (Conv2D) (None, 128, 47, 153) 8320 activation_5[0][0]

block_2a_bn_2 (BatchNormalizati (None, 128, 47, 153) 512 block_2a_conv_2[0][0]

block_2a_bn_shortcut (BatchNorm (None, 128, 47, 153) 512 block_2a_conv_shortcut[0][0]

add_3 (Add) (None, 128, 47, 153) 0 block_2a_bn_2[0][0]
block_2a_bn_shortcut[0][0]

activation_7 (Activation) (None, 128, 47, 153) 0 add_3[0][0]

block_2b_conv_1 (Conv2D) (None, 128, 47, 153) 147584 activation_7[0][0]

block_2b_bn_1 (BatchNormalizati (None, 128, 47, 153) 512 block_2b_conv_1[0][0]

activation_8 (Activation) (None, 128, 47, 153) 0 block_2b_bn_1[0][0]

block_2b_conv_2 (Conv2D) (None, 128, 47, 153) 147584 activation_8[0][0]

block_2b_conv_shortcut (Conv2D) (None, 128, 47, 153) 16512 activation_7[0][0]

block_2b_bn_2 (BatchNormalizati (None, 128, 47, 153) 512 block_2b_conv_2[0][0]

block_2b_bn_shortcut (BatchNorm (None, 128, 47, 153) 512 block_2b_conv_shortcut[0][0]

add_4 (Add) (None, 128, 47, 153) 0 block_2b_bn_2[0][0]
block_2b_bn_shortcut[0][0]

activation_9 (Activation) (None, 128, 47, 153) 0 add_4[0][0]

block_3a_conv_1 (Conv2D) (None, 256, 24, 77) 295168 activation_9[0][0]

block_3a_bn_1 (BatchNormalizati (None, 256, 24, 77) 1024 block_3a_conv_1[0][0]

activation_10 (Activation) (None, 256, 24, 77) 0 block_3a_bn_1[0][0]

block_3a_conv_2 (Conv2D) (None, 256, 24, 77) 590080 activation_10[0][0]

block_3a_conv_shortcut (Conv2D) (None, 256, 24, 77) 33024 activation_9[0][0]

block_3a_bn_2 (BatchNormalizati (None, 256, 24, 77) 1024 block_3a_conv_2[0][0]

block_3a_bn_shortcut (BatchNorm (None, 256, 24, 77) 1024 block_3a_conv_shortcut[0][0]

add_5 (Add) (None, 256, 24, 77) 0 block_3a_bn_2[0][0]
block_3a_bn_shortcut[0][0]

activation_11 (Activation) (None, 256, 24, 77) 0 add_5[0][0]

block_3b_conv_1 (Conv2D) (None, 256, 24, 77) 590080 activation_11[0][0]

block_3b_bn_1 (BatchNormalizati (None, 256, 24, 77) 1024 block_3b_conv_1[0][0]

activation_12 (Activation) (None, 256, 24, 77) 0 block_3b_bn_1[0][0]

block_3b_conv_2 (Conv2D) (None, 256, 24, 77) 590080 activation_12[0][0]

block_3b_conv_shortcut (Conv2D) (None, 256, 24, 77) 65792 activation_11[0][0]

block_3b_bn_2 (BatchNormalizati (None, 256, 24, 77) 1024 block_3b_conv_2[0][0]

block_3b_bn_shortcut (BatchNorm (None, 256, 24, 77) 1024 block_3b_conv_shortcut[0][0]

add_6 (Add) (None, 256, 24, 77) 0 block_3b_bn_2[0][0]
block_3b_bn_shortcut[0][0]

activation_13 (Activation) (None, 256, 24, 77) 0 add_6[0][0]

block_4a_conv_1 (Conv2D) (None, 512, 24, 77) 1180160 activation_13[0][0]

block_4a_bn_1 (BatchNormalizati (None, 512, 24, 77) 2048 block_4a_conv_1[0][0]

activation_14 (Activation) (None, 512, 24, 77) 0 block_4a_bn_1[0][0]

block_4a_conv_2 (Conv2D) (None, 512, 24, 77) 2359808 activation_14[0][0]

block_4a_conv_shortcut (Conv2D) (None, 512, 24, 77) 131584 activation_13[0][0]

block_4a_bn_2 (BatchNormalizati (None, 512, 24, 77) 2048 block_4a_conv_2[0][0]

block_4a_bn_shortcut (BatchNorm (None, 512, 24, 77) 2048 block_4a_conv_shortcut[0][0]

add_7 (Add) (None, 512, 24, 77) 0 block_4a_bn_2[0][0]
block_4a_bn_shortcut[0][0]

activation_15 (Activation) (None, 512, 24, 77) 0 add_7[0][0]

block_4b_conv_1 (Conv2D) (None, 512, 24, 77) 2359808 activation_15[0][0]

block_4b_bn_1 (BatchNormalizati (None, 512, 24, 77) 2048 block_4b_conv_1[0][0]

activation_16 (Activation) (None, 512, 24, 77) 0 block_4b_bn_1[0][0]

block_4b_conv_2 (Conv2D) (None, 512, 24, 77) 2359808 activation_16[0][0]

block_4b_conv_shortcut (Conv2D) (None, 512, 24, 77) 262656 activation_15[0][0]

block_4b_bn_2 (BatchNormalizati (None, 512, 24, 77) 2048 block_4b_conv_2[0][0]

block_4b_bn_shortcut (BatchNorm (None, 512, 24, 77) 2048 block_4b_conv_shortcut[0][0]

add_8 (Add) (None, 512, 24, 77) 0 block_4b_bn_2[0][0]
block_4b_bn_shortcut[0][0]

activation_17 (Activation) (None, 512, 24, 77) 0 add_8[0][0]

output_bbox (Conv2D) (None, 12, 24, 77) 6156 activation_17[0][0]

output_cov (Conv2D) (None, 3, 24, 77) 1539 activation_17[0][0]

Total params: 11,555,983
Trainable params: 11,378,831
Non-trainable params: 177,152

target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
2020-03-09 10:09:35,958 [INFO] iva.detectnet_v2.scripts.train: Found 5985 samples in training set
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
2020-03-09 10:09:52,850 [INFO] iva.detectnet_v2.scripts.train: Found 1496 samples in validation set
INFO:tensorflow:Create CheckpointSaverHook.
2020-03-09 10:10:08,332 [INFO] tensorflow: Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
2020-03-09 10:10:09,681 [INFO] tensorflow: Graph was finalized.
2020-03-09 10:10:09.681988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
2020-03-09 10:10:09.682041: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-03-09 10:10:09.682073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0
2020-03-09 10:10:09.682082: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N
2020-03-09 10:10:09.682193: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 2316 MB memory) → physical GPU (device: 0, name: GeForce GTX 970M, pci bus id: 0000:01:00.0, compute capability: 5.2)
INFO:tensorflow:Running local_init_op.
2020-03-09 10:10:13,846 [INFO] tensorflow: Running local_init_op.
INFO:tensorflow:Done running local_init_op.
2020-03-09 10:10:14,705 [INFO] tensorflow: Done running local_init_op.
INFO:tensorflow:Saving checkpoints for step-0.
2020-03-09 10:10:46,638 [INFO] tensorflow: Saving checkpoints for step-0.
2020-03-09 10:11:58.735496: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2020-03-09 10:11:59.263596: I tensorflow/core/kernels/cuda_solvers.cc:159] Creating CudaSolver handles for stream 0x664c0c0
2020-03-09 10:12:00.055130: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at whole_file_read_ops.cc:114 : Not found: /workspace/tlt-experiments/image_3resized/image_2/006665.png; No such file or directory
2020-03-09 10:12:00.055158: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at whole_file_read_ops.cc:114 : Not found: /workspace/tlt-experiments/image_3resized/image_2/001510.png; No such file or directory
2020-03-09 10:12:00.055181: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at whole_file_read_ops.cc:114 : Not found: /workspace/tlt-experiments/image_3resized/image_2/006800.png; No such file or directory
2020-03-09 10:12:00.055194: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at whole_file_read_ops.cc:114 : Not found: /workspace/tlt-experiments/image_3resized/image_2/006936.png; No such file or directory
2020-03-09 10:12:00.055258: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at whole_file_read_ops.cc:114 : Not found: /workspace/tlt-experiments/image_3resized/image_2/003411.png; No such file or directory
2020-03-09 10:12:00.055265: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at whole_file_read_ops.cc:114 : Not found: /workspace/tlt-experiments/image_3resized/image_2/000824.png; No such file or directory
2020-03-09 10:12:00.055379: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at whole_file_read_ops.cc:114 : Not found: /workspace/tlt-experiments/image_3resized/image_2/005739.png; No such file or directory
2020-03-09 10:12:00.055402: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at whole_file_read_ops.cc:114 : Not found: /workspace/tlt-experiments/image_3resized/image_2/003206.png; No such file or directory
2020-03-09 10:12:00.055539: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at whole_file_read_ops.cc:114 : Not found: /workspace/tlt-experiments/image_3resized/image_2/004149.png; No such file or directory
2020-03-09 10:12:00.055658: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at whole_file_read_ops.cc:114 : Not found: /workspace/tlt-experiments/image_3resized/image_2/003523.png; No such file or directory
2020-03-09 10:12:00.055850: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at whole_file_read_ops.cc:114 : Not found: /workspace/tlt-experiments/image_3resized/image_2/004978.png; No such file or directory
2020-03-09 10:12:00.056028: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at whole_file_read_ops.cc:114 : Not found: /workspace/tlt-experiments/image_3resized/image_2/000857.png; No such file or directory
2020-03-09 10:12:00.056222: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at whole_file_read_ops.cc:114 : Not found: /workspace/tlt-experiments/image_3resized/image_2/001433.png; No such file or directory
2020-03-09 10:12:00.056395: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at whole_file_read_ops.cc:114 : Not found: /workspace/tlt-experiments/image_3resized/image_2/006643.png; No such file or directory
2020-03-09 10:12:00.056569: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at whole_file_read_ops.cc:114 : Not found: /workspace/tlt-experiments/image_3resized/image_2/002397.png; No such file or directory
2020-03-09 10:12:00.056753: W tensorflow/core/framework/op_kernel.cc:1401] OP_REQUIRES failed at whole_file_read_ops.cc:114 : Not found: /workspace/tlt-experiments/image_3resized/image_2/000235.png; No such file or directory
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “./common/magnet_train.py”, line 37, in main
File “</usr/local/lib/python2.7/dist-packages/decorator.pyc:decorator-gen-2>”, line 2, in main
File “./detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “./detectnet_v2/scripts/train.py”, line 633, in main
File “./detectnet_v2/scripts/train.py”, line 557, in run_experiment
File “./detectnet_v2/scripts/train.py”, line 491, in train_gridbox
File “./detectnet_v2/scripts/train.py”, line 136, in run_training_loop
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py”, line 676, in run
run_metadata=run_metadata)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py”, line 1270, in run
raise six.reraise(*original_exc_info)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py”, line 1255, in run
return self._sess.run(*args, **kwargs)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py”, line 1327, in run
run_metadata=run_metadata)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py”, line 1091, in run
return self._sess.run(*args, **kwargs)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 929, in run
run_metadata_ptr)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 1152, in _run
feed_dict_tensor, options, run_metadata)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 1328, in _do_run
run_metadata)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py”, line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: /workspace/tlt-experiments/image_3resized/image_2/006665.png; No such file or directory
[[node LoadFile_13/ReadFile (defined at ./modulus/processors/load_file.py:40) ]]

Caused by op u’LoadFile_13/ReadFile’, defined at:
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “./common/magnet_train.py”, line 37, in main
File “</usr/local/lib/python2.7/dist-packages/decorator.pyc:decorator-gen-2>”, line 2, in main
File “./detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “./detectnet_v2/scripts/train.py”, line 633, in main
File “./detectnet_v2/scripts/train.py”, line 557, in run_experiment
File “./detectnet_v2/scripts/train.py”, line 467, in train_gridbox
File “./detectnet_v2/scripts/train.py”, line 297, in build_training_graph
File “./detectnet_v2/dataloader/default_dataloader.py”, line 203, in get_dataset_tensors
File “./detectnet_v2/dataloader/default_dataloader.py”, line 244, in _generate_images_and_ground_truth_labels
File “./detectnet_v2/dataloader/default_dataloader.py”, line 392, in _load_input_tensors
File “./detectnet_v2/dataloader/utilities.py”, line 272, in read_image
File “./modulus/processors/processors.py”, line 227, in call
File “./modulus/processors/load_file.py”, line 40, in call
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_io_ops.py”, line 589, in read_file
“ReadFile”, filename=filename, name=name)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py”, line 788, in _apply_op_helper
op_def=op_def)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py”, line 507, in new_func
return func(*args, **kwargs)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py”, line 3300, in create_op
op_def=op_def)
File “/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py”, line 1801, in init
self._traceback = tf_stack.extract_stack()

NotFoundError (see above for traceback): /workspace/tlt-experiments/image_3resized/image_2/006665.png; No such file or directory
[[node LoadFile_13/ReadFile (defined at ./modulus/processors/load_file.py:40) ]]

The Spec file to drive the training is :

Sample model config for to instantiate a resnet18 model with pretrained weights and freeze blocks 0, 1

with all shortcuts having projection layers.

model_config {
arch: “resnet”
pretrained_model_file: “/workspace/tlt_resnet18_detectnet_v2_v1/resnet18.hdf5”
freeze_blocks: 0
freeze_blocks: 1
all_projections: True
num_layers: 18
use_pooling: False
use_batch_norm: True
dropout_rate: 0.0
training_precision: {
backend_floatx: FLOAT32
}
objective_set: {
cov {}
bbox {
scale: 35.0
offset: 0.5
}
}
}

Sample rasterizer configs to instantiate a 3 class bbox rasterizer

bbox_rasterizer_config {
target_class_config {
key: “car”
value: {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 0.4
cov_radius_y: 0.4
bbox_min_radius: 1.0
}
}
target_class_config {
key: “cyclist”
value: {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 0.4
cov_radius_y: 0.4
bbox_min_radius: 1.0
}
}
target_class_config {
key: “pedestrian”
value: {
cov_center_x: 0.5
cov_center_y: 0.5
cov_radius_x: 0.4
cov_radius_y: 0.4
bbox_min_radius: 1.0
}
}
deadzone_radius: 0.67
}

postprocessing_config {
target_class_config {
key: “car”
value: {
clustering_config {
coverage_threshold: 0.005
dbscan_eps: 0.15
dbscan_min_samples: 0.05
minimum_bounding_box_height: 20
}
}
}
target_class_config {
key: “cyclist”
value: {
clustering_config {
coverage_threshold: 0.005
dbscan_eps: 0.15
dbscan_min_samples: 0.05
minimum_bounding_box_height: 20
}
}
}
target_class_config {
key: “pedestrian”
value: {
clustering_config {
coverage_threshold: 0.005
dbscan_eps: 0.15
dbscan_min_samples: 0.05
minimum_bounding_box_height: 20
}
}
}
}
cost_function_config {
target_classes {
name: “car”
class_weight: 1.0
coverage_foreground_weight: 0.05
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 10.0
}
}
target_classes {
name: “cyclist”
class_weight: 1.0
coverage_foreground_weight: 0.05
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 1.0
}
}
target_classes {
name: “pedestrian”
class_weight: 1.0
coverage_foreground_weight: 0.05
objectives {
name: “cov”
initial_weight: 1.0
weight_target: 1.0
}
objectives {
name: “bbox”
initial_weight: 10.0
weight_target: 10.0
}
}
enable_autoweighting: True
max_objective_weight: 0.9999
min_objective_weight: 0.0001
}

training_config {
batch_size_per_gpu: 16
num_epochs: 80
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 5e-6
max_learning_rate: 5e-4
soft_start: 0.1
annealing: 0.7
}
}
regularizer {
type: L1
weight: 3e-9
}
optimizer {
adam {
epsilon: 1e-08
beta1: 0.9
beta2: 0.999
}
}
cost_scaling {
enabled: False
initial_exponent: 20.0
increment: 0.005
decrement: 1.0
}
}

Sample augementation config for

augmentation_config {
preprocessing {
output_image_width: 1224
output_image_height: 370
output_image_channel: 3
min_bbox_width: 1.0
min_bbox_height: 1.0
}
spatial_augmentation {

hflip_probability: 0.5
vflip_probability: 0.0
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 8.0
translate_max_y: 8.0

}
color_augmentation {
color_shift_stddev: 0.0
hue_rotation_max: 25.0
saturation_shift_max: 0.2
contrast_scale_max: 0.1
contrast_center: 0.5
}
}

Sample evaluation config to run evaluation in integrate mode for the given 3 class model,

at every 10th epoch starting from the epoch 1.

evaluation_config {
average_precision_mode: INTEGRATE
validation_period_during_training: 10
first_validation_epoch: 1
minimum_detection_ground_truth_overlap {
key: “car”
value: 0.7
}
minimum_detection_ground_truth_overlap {
key: “pedestrian”
value: 0.5
}
minimum_detection_ground_truth_overlap {
key: “cyclist”
value: 0.5
}
evaluation_box_config {
key: “car”
value {
minimum_height: 4
maximum_height: 9999
minimum_width: 4
maximum_width: 9999
}
}
evaluation_box_config {
key: “pedestrian”
value {
minimum_height: 4
maximum_height: 9999
minimum_width: 4
maximum_width: 9999
}
}
evaluation_box_config {
key: “bicycle”
value {
minimum_height: 4
maximum_height: 9999
minimum_width: 4
maximum_width: 9999
}
}
}

dataset_config {
data_sources: {
tfrecords_path: “/workspace/tlt-experiments/tf_records/*”
image_directory_path: “/workspace/tlt-experiments/image_3resized”
}
image_extension: “png”
target_class_mapping {
key: “car”
value: “car”
}
target_class_mapping {
key: “van”
value: “car”
}
target_class_mapping {
key: “truck”
value: “car”
}
target_class_mapping {
key: “pedestrian”
value: “pedestrian”
}
target_class_mapping {
key: “cyclist”
value: “cyclist”
}
validation_fold: 0
}

Morganh · March 11, 2020, 2:19am

How did you generate tfrecords?
Please paste the spec file when you run tlt-dataset-convert.
If possible, please paste the log when you run tlt-dataset-convert too.

Topic		Replies	Views
Training detectnet_v2 Issue TAO Toolkit	15	1844	October 12, 2021
tlt first tutorial error TAO Toolkit	3	769	October 12, 2021
How to resize KITTI dataset images and labels TAO Toolkit	10	2645	October 12, 2021
Core dump Illegal Instruction on detectnet_v2 example TAO Toolkit	17	1984	October 12, 2021
SSD Resnet 18 only learns 3 out of 5 classes TAO Toolkit	5	609	October 12, 2021
Error with Evaluation of trained model TAO Toolkit	3	829	October 12, 2021
Model retraining warning TAO Toolkit	7	1024	October 12, 2021
tlt-train error when deploy mobilenet_v2 by using DetectNet TAO Toolkit	28	2364	October 12, 2021
Error on tlt-training detectnet_v2? TAO Toolkit	6	473	October 12, 2021
Cannot convert FasterRCNN TLT model to trt engine TAO Toolkit	9	1104	October 12, 2021