ValueError: No dataset tfrecords file found at path

v.snippy · August 13, 2020, 10:51am

Hi, I’m trying to use tlt to train DetectNet_V2

I have the data and have converted it into tfrecords. But when I try to run the tlt-train I get the following error.

ValueError: No dataset tfrecords file found at path: ‘/workspace/tlt-experiments/data/data_with_eval/train/tfrecords’

I have verified the the records files are present in that folder. I have attached a screenshot of the contents of the folder below.

Full log is below :

Using TensorFlow backend.
2020-08-13 10:43:50.381686: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

[[39368,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
Host: dce1b73ec71f

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.

2020-08-13 10:43:52.428678: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-08-13 10:43:52.452169: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-13 10:43:52.452820: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:09:00.0
2020-08-13 10:43:52.452849: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-08-13 10:43:52.452906: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-08-13 10:43:52.454247: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-08-13 10:43:52.454615: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-08-13 10:43:52.456395: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-08-13 10:43:52.457664: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-08-13 10:43:52.457721: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-08-13 10:43:52.457927: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-13 10:43:52.458647: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-13 10:43:52.459226: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-08-13 10:43:52.459269: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-08-13 10:43:53.068882: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-08-13 10:43:53.068932: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-08-13 10:43:53.068940: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-08-13 10:43:53.069198: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-13 10:43:53.069670: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-13 10:43:53.070111: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-13 10:43:53.070512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9831 MB memory) → physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:09:00.0, compute capability: 7.5)
2020-08-13 10:43:53,071 [INFO] iva.detectnet_v2.scripts.train: Loading experiment spec at detect_net_config.txt.
2020-08-13 10:43:53,073 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from detect_net_config.txt
Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 55, in main
File “”, line 2, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 773, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 691, in run_experiment
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 557, in train_gridbox
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataloader/build_dataloader.py”, line 264, in build_dataloader
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataloader/drivenet_dataloader.py”, line 384, in init
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataloader/drivenet_dataloader.py”, line 427, in _construct_data_sources
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataloader/drivenet_dataloader.py”, line 290, in init
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/modulusobject/modulusobject.py”, line 432, in wrapper
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/core/build_wheel.runfiles/ai_infra/moduluspy/modulus/blocks/data_loaders/multi_source_loader/sources/tfrecords_data_source.py”, line 62, in init
ValueError: No dataset tfrecords file found at path: ‘/workspace/tlt-experiments/data/data_with_eval/train/tfrecords’

tlt-dataset-convert output :

Using TensorFlow backend.
2020-08-13 10:49:20,496 - iva.detectnet_v2.dataio.build_converter - INFO - Instantiating a kitti converter
2020-08-13 10:49:20,498 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Num images in
Train: 603 Val: 98
2020-08-13 10:49:20,498 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Validation data in partition 0. Hence, while choosing the validationset during training choose validation_fold 0.
2020-08-13 10:49:20,498 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 0
WARNING:tensorflow:From /home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataio/dataset_converter_lib.py:142: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

2020-08-13 10:49:20,498 - tensorflow - WARNING - From /home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/dataio/dataset_converter_lib.py:142: The name tf.python_io.TFRecordWriter is deprecated. Please use tf.io.TFRecordWriter instead.

/usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:273: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
2020-08-13 10:49:20,511 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 1
2020-08-13 10:49:20,517 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 2
2020-08-13 10:49:20,523 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 3
2020-08-13 10:49:20,530 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 4
2020-08-13 10:49:20,537 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 5
2020-08-13 10:49:20,543 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 6
2020-08-13 10:49:20,549 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 7
2020-08-13 10:49:20,555 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 8
2020-08-13 10:49:20,561 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 9
2020-08-13 10:49:20,572 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
b’with_mask’: 435
b’without_mask’: 66
b’mask_weared_incorrect’: 12

2020-08-13 10:49:20,572 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 0
2020-08-13 10:49:20,614 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 1
2020-08-13 10:49:20,656 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 2
2020-08-13 10:49:20,697 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 3
2020-08-13 10:49:20,740 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 4
2020-08-13 10:49:20,782 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 5
2020-08-13 10:49:20,823 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 6
2020-08-13 10:49:20,866 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 7
2020-08-13 10:49:20,906 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 8
2020-08-13 10:49:20,947 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 9
2020-08-13 10:49:20,988 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
b’with_mask’: 2266
b’without_mask’: 555
b’mask_weared_incorrect’: 90

2020-08-13 10:49:20,989 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Cumulative object statistics
2020-08-13 10:49:20,989 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO -
Wrote the following numbers of objects:
b’with_mask’: 2701
b’without_mask’: 621
b’mask_weared_incorrect’: 102

2020-08-13 10:49:20,989 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Class map.
Label in GT: Label in tfrecords file
b’with_mask’: b’with_mask’
b’without_mask’: b’without_mask’
b’mask_weared_incorrect’: b’mask_weared_incorrect’
For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap.

2020-08-13 10:49:20,989 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Tfrecords generation complete.

Data convert config :

kitti_config {
root_directory_path: “/workspace/tlt-experiments/data/data_with_eval/train”
image_dir_name: “images”
label_dir_name: “labels”
image_extension: “.jpg”
partition_mode: “random”
num_partitions: 2
val_split: 14
num_shards: 10
}
image_directory_path: “/workspace/tlt-experiments/data/data_with_eval/train”

Data convert command

tlt-dataset-convert -d converter_config/config_train.txt -o data/data_with_eval/train/tfrecords/

Detectnet v2 config :

random_seed: 42
dataset_config {
data_sources: {
tfrecords_path: “/workspace/tlt-experiments/data/data_with_eval/train/tfrecords*”
image_directory_path: “/workspace/tlt-experiments/data/data_with_eval/train/”
}
image_extension: “jpg”
target_class_mapping {
key: “with_mask”
value: “with_mask”
}
target_class_mapping {
key: “with_out_mask”
value: “with_out_mask”
}
target_class_mapping {
key: “mask_weared_incorrect”
value: “mask_weared_incorrect”
}
validation_fold: 0
}
model_config {
pretrained_model_file: “/workspace/tlt-experiments/tlt_peoplenet_vunpruned_v2.0/resnet34_peoplenet.tlt”
num_layers: 34
freeze_blocks: 0
arch: “resnet”
use_batch_norm: true
objective_set {
bbox {
scale: 35.0
offset: 0.5
}
cov {
}
}
training_precision {
backend_floatx: FLOAT32
}
}
training_config {
batch_size_per_gpu: 24
num_epochs: 120
learning_rate {
soft_start_annealing_schedule {
min_learning_rate: 5e-06
max_learning_rate: 0.0005
soft_start: 0.1
annealing: 0.7
}
}
regularizer {
type: L1
weight: 3e-09
}
optimizer {
adam {
epsilon: 9.9e-09
beta1: 0.9
beta2: 0.999
}
}
cost_scaling {
initial_exponent: 20.0
increment: 0.005
decrement: 1.0
}
checkpoint_interval: 10
}
augmentation_config {
preprocessing {
output_image_width: 960
output_image_height: 544
output_image_channel: 3
crop_right: 960
crop_bottom: 544
min_bbox_width: 1.0
min_bbox_height: 1.0
}
spatial_augmentation {
hflip_probability: 0.5
zoom_min: 1.0
zoom_max: 1.0
translate_max_x: 8.0
translate_max_y: 8.0
}
color_augmentation {
hue_rotation_max: 25.0
saturation_shift_max: 0.20000000298
contrast_scale_max: 0.10000000149
contrast_center: 0.5
}
}
postprocessing_config{
target_class_config{
key: “with_mask”
value: {
clustering_config {
coverage_threshold: 0.005
dbscan_eps: 0.265
dbscan_min_samples: 0.05
minimum_bounding_box_height: 4
}
}
}
target_class_config{
key: “with_out_mask”
value: {
clustering_config {
coverage_threshold: 0.005
dbscan_eps: 0.15
dbscan_min_samples: 0.05
minimum_bounding_box_height: 4
}
}
}
target_class_config{
key: “mask_weared_incorrect”
value: {
clustering_config {
coverage_threshold: 0.005
dbscan_eps: 0.15
dbscan_min_samples: 0.05
minimum_bounding_box_height: 2
}
}
}
}

TLT train command

tlt-train detectnet_v2 -e detect_net_config.txt -r output_dir/ -k tlt --gpus 1

Contents of tfrecords folder :

Morganh · August 14, 2020, 2:01am

Please login the docker to check if below path exists.

$ ls /workspace/tlt-experiments/data/data_with_eval/train/tfrecords*

v.snippy · August 14, 2020, 3:57am

HI, I tried it and this is the output that I got.

root@dce1b73ec71f:/workspace/tlt-experiments# ls /workspace/tlt-experiments/data/data_with_eval/train/tfrecords*
-fold-000-of-002-shard-00000-of-00010 -fold-000-of-002-shard-00004-of-00010 -fold-000-of-002-shard-00008-of-00010 -fold-001-of-002-shard-00002-of-00010 -fold-001-of-002-shard-00006-of-00010
-fold-000-of-002-shard-00001-of-00010 -fold-000-of-002-shard-00005-of-00010 -fold-000-of-002-shard-00009-of-00010 -fold-001-of-002-shard-00003-of-00010 -fold-001-of-002-shard-00007-of-00010
-fold-000-of-002-shard-00002-of-00010 -fold-000-of-002-shard-00006-of-00010 -fold-001-of-002-shard-00000-of-00010 -fold-001-of-002-shard-00004-of-00010 -fold-001-of-002-shard-00008-of-00010
-fold-000-of-002-shard-00003-of-00010 -fold-000-of-002-shard-00007-of-00010 -fold-001-of-002-shard-00001-of-00010 -fold-001-of-002-shard-00005-of-00010 -fold-001-of-002-shard-00009-of-00010

Morganh · August 14, 2020, 4:27am

Please modify below in your spec and retry.

Change

tfrecords_path: “/workspace/tlt-experiments/data/data_with_eval/train/tfrecords*”

to

tfrecords_path: “/workspace/tlt-experiments/data/data_with_eval/train/*”

v.snippy · August 14, 2020, 4:47am

Hi, I tried that and I am getting this error.

ValueError: No dataset tfrecords file found at path: ‘/workspace/tlt-experiments/data/data_with_eval/train/labels’

It’s looking in the labels folder.

Update :

I tried renaming the tfrecords folder to labels, still it didn’t work.

Morganh · August 14, 2020, 4:53am

Please paste below result.(firstly $ apt-get install tree)

$ tree /workspace/tlt-experiments/data/data_with_eval/train/

v.snippy · August 14, 2020, 4:57am

Hi,

treeout.txt (63.9 KB)

I have attached the output of the tree command.

Morganh · August 14, 2020, 5:05am

So, I don’t know why your train/label folder have tfrecords files.

Suggest to generate tfrecords again. Set “-o” to a new folder instead of train folder to avoid confusion. And also add a prefix to each tfrecords file.

As below.

$ tlt-dataset-convert -d converter_config/config_train.txt -o data/data_with_eval/tfrecords/test

Then modify the spec and try again.

v.snippy · August 14, 2020, 5:28am

Hi,

Thank you those errors are fixed now. This is the error that I’m getting now.

Using TensorFlow backend.
2020-08-14 05:26:32.386774: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

[[33358,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
Host: dce1b73ec71f

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.

2020-08-14 05:26:34.469661: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-08-14 05:26:34.493875: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-14 05:26:34.494653: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:09:00.0
2020-08-14 05:26:34.494689: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-08-14 05:26:34.494753: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2020-08-14 05:26:34.496255: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
2020-08-14 05:26:34.496636: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
2020-08-14 05:26:34.498080: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
2020-08-14 05:26:34.498970: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
2020-08-14 05:26:34.499010: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-08-14 05:26:34.499124: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-14 05:26:34.499671: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-14 05:26:34.500102: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0
2020-08-14 05:26:34.500129: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
2020-08-14 05:26:35.108887: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-08-14 05:26:35.108942: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1165] 0
2020-08-14 05:26:35.108948: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0: N
2020-08-14 05:26:35.109204: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-14 05:26:35.109673: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-14 05:26:35.110114: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-08-14 05:26:35.110513: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1304] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 9831 MB memory) → physical GPU (device: 0, name: GeForce RTX 2080 Ti, pci bus id: 0000:09:00.0, compute capability: 7.5)
2020-08-14 05:26:35,111 [INFO] iva.detectnet_v2.scripts.train: Loading experiment spec at detect_net_config.txt.
2020-08-14 05:26:35,113 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from detect_net_config.txt
2020-08-14 05:26:35,276 [INFO] iva.detectnet_v2.scripts.train: Cannot iterate over exactly 603 samples with a batch size of 24; each epoch will therefore take one extra step.

Layer (type) Output Shape Param # Connected to

input_1 (InputLayer) (None, 3, 544, 960) 0

conv1 (Conv2D) (None, 64, 272, 480) 9472 input_1[0][0]

bn_conv1 (BatchNormalization) (None, 64, 272, 480) 256 conv1[0][0]

activation_1 (Activation) (None, 64, 272, 480) 0 bn_conv1[0][0]

block_1a_conv_1 (Conv2D) (None, 64, 136, 240) 36928 activation_1[0][0]

block_1a_bn_1 (BatchNormalizati (None, 64, 136, 240) 256 block_1a_conv_1[0][0]

block_1a_relu_1 (Activation) (None, 64, 136, 240) 0 block_1a_bn_1[0][0]

block_1a_conv_2 (Conv2D) (None, 64, 136, 240) 36928 block_1a_relu_1[0][0]

block_1a_conv_shortcut (Conv2D) (None, 64, 136, 240) 4160 activation_1[0][0]

block_1a_bn_2 (BatchNormalizati (None, 64, 136, 240) 256 block_1a_conv_2[0][0]

block_1a_bn_shortcut (BatchNorm (None, 64, 136, 240) 256 block_1a_conv_shortcut[0][0]

add_1 (Add) (None, 64, 136, 240) 0 block_1a_bn_2[0][0]
block_1a_bn_shortcut[0][0]

block_1a_relu (Activation) (None, 64, 136, 240) 0 add_1[0][0]

block_1b_conv_1 (Conv2D) (None, 64, 136, 240) 36928 block_1a_relu[0][0]

block_1b_bn_1 (BatchNormalizati (None, 64, 136, 240) 256 block_1b_conv_1[0][0]

block_1b_relu_1 (Activation) (None, 64, 136, 240) 0 block_1b_bn_1[0][0]

block_1b_conv_2 (Conv2D) (None, 64, 136, 240) 36928 block_1b_relu_1[0][0]

block_1b_bn_2 (BatchNormalizati (None, 64, 136, 240) 256 block_1b_conv_2[0][0]

add_2 (Add) (None, 64, 136, 240) 0 block_1b_bn_2[0][0]
block_1a_relu[0][0]

block_1b_relu (Activation) (None, 64, 136, 240) 0 add_2[0][0]

block_1c_conv_1 (Conv2D) (None, 64, 136, 240) 36928 block_1b_relu[0][0]

block_1c_bn_1 (BatchNormalizati (None, 64, 136, 240) 256 block_1c_conv_1[0][0]

block_1c_relu_1 (Activation) (None, 64, 136, 240) 0 block_1c_bn_1[0][0]

block_1c_conv_2 (Conv2D) (None, 64, 136, 240) 36928 block_1c_relu_1[0][0]

block_1c_bn_2 (BatchNormalizati (None, 64, 136, 240) 256 block_1c_conv_2[0][0]

add_3 (Add) (None, 64, 136, 240) 0 block_1c_bn_2[0][0]
block_1b_relu[0][0]

block_1c_relu (Activation) (None, 64, 136, 240) 0 add_3[0][0]

block_2a_conv_1 (Conv2D) (None, 128, 68, 120) 73856 block_1c_relu[0][0]

block_2a_bn_1 (BatchNormalizati (None, 128, 68, 120) 512 block_2a_conv_1[0][0]

block_2a_relu_1 (Activation) (None, 128, 68, 120) 0 block_2a_bn_1[0][0]

block_2a_conv_2 (Conv2D) (None, 128, 68, 120) 147584 block_2a_relu_1[0][0]

block_2a_conv_shortcut (Conv2D) (None, 128, 68, 120) 8320 block_1c_relu[0][0]

block_2a_bn_2 (BatchNormalizati (None, 128, 68, 120) 512 block_2a_conv_2[0][0]

block_2a_bn_shortcut (BatchNorm (None, 128, 68, 120) 512 block_2a_conv_shortcut[0][0]

add_4 (Add) (None, 128, 68, 120) 0 block_2a_bn_2[0][0]
block_2a_bn_shortcut[0][0]

block_2a_relu (Activation) (None, 128, 68, 120) 0 add_4[0][0]

block_2b_conv_1 (Conv2D) (None, 128, 68, 120) 147584 block_2a_relu[0][0]

block_2b_bn_1 (BatchNormalizati (None, 128, 68, 120) 512 block_2b_conv_1[0][0]

block_2b_relu_1 (Activation) (None, 128, 68, 120) 0 block_2b_bn_1[0][0]

block_2b_conv_2 (Conv2D) (None, 128, 68, 120) 147584 block_2b_relu_1[0][0]

block_2b_bn_2 (BatchNormalizati (None, 128, 68, 120) 512 block_2b_conv_2[0][0]

add_5 (Add) (None, 128, 68, 120) 0 block_2b_bn_2[0][0]
block_2a_relu[0][0]

block_2b_relu (Activation) (None, 128, 68, 120) 0 add_5[0][0]

block_2c_conv_1 (Conv2D) (None, 128, 68, 120) 147584 block_2b_relu[0][0]

block_2c_bn_1 (BatchNormalizati (None, 128, 68, 120) 512 block_2c_conv_1[0][0]

block_2c_relu_1 (Activation) (None, 128, 68, 120) 0 block_2c_bn_1[0][0]

block_2c_conv_2 (Conv2D) (None, 128, 68, 120) 147584 block_2c_relu_1[0][0]

block_2c_bn_2 (BatchNormalizati (None, 128, 68, 120) 512 block_2c_conv_2[0][0]

add_6 (Add) (None, 128, 68, 120) 0 block_2c_bn_2[0][0]
block_2b_relu[0][0]

block_2c_relu (Activation) (None, 128, 68, 120) 0 add_6[0][0]

block_2d_conv_1 (Conv2D) (None, 128, 68, 120) 147584 block_2c_relu[0][0]

block_2d_bn_1 (BatchNormalizati (None, 128, 68, 120) 512 block_2d_conv_1[0][0]

block_2d_relu_1 (Activation) (None, 128, 68, 120) 0 block_2d_bn_1[0][0]

block_2d_conv_2 (Conv2D) (None, 128, 68, 120) 147584 block_2d_relu_1[0][0]

block_2d_bn_2 (BatchNormalizati (None, 128, 68, 120) 512 block_2d_conv_2[0][0]

add_7 (Add) (None, 128, 68, 120) 0 block_2d_bn_2[0][0]
block_2c_relu[0][0]

block_2d_relu (Activation) (None, 128, 68, 120) 0 add_7[0][0]

block_3a_conv_1 (Conv2D) (None, 256, 34, 60) 295168 block_2d_relu[0][0]

block_3a_bn_1 (BatchNormalizati (None, 256, 34, 60) 1024 block_3a_conv_1[0][0]

block_3a_relu_1 (Activation) (None, 256, 34, 60) 0 block_3a_bn_1[0][0]

block_3a_conv_2 (Conv2D) (None, 256, 34, 60) 590080 block_3a_relu_1[0][0]

block_3a_conv_shortcut (Conv2D) (None, 256, 34, 60) 33024 block_2d_relu[0][0]

block_3a_bn_2 (BatchNormalizati (None, 256, 34, 60) 1024 block_3a_conv_2[0][0]

block_3a_bn_shortcut (BatchNorm (None, 256, 34, 60) 1024 block_3a_conv_shortcut[0][0]

add_8 (Add) (None, 256, 34, 60) 0 block_3a_bn_2[0][0]
block_3a_bn_shortcut[0][0]

block_3a_relu (Activation) (None, 256, 34, 60) 0 add_8[0][0]

block_3b_conv_1 (Conv2D) (None, 256, 34, 60) 590080 block_3a_relu[0][0]

block_3b_bn_1 (BatchNormalizati (None, 256, 34, 60) 1024 block_3b_conv_1[0][0]

block_3b_relu_1 (Activation) (None, 256, 34, 60) 0 block_3b_bn_1[0][0]

block_3b_conv_2 (Conv2D) (None, 256, 34, 60) 590080 block_3b_relu_1[0][0]

block_3b_bn_2 (BatchNormalizati (None, 256, 34, 60) 1024 block_3b_conv_2[0][0]

add_9 (Add) (None, 256, 34, 60) 0 block_3b_bn_2[0][0]
block_3a_relu[0][0]

block_3b_relu (Activation) (None, 256, 34, 60) 0 add_9[0][0]

block_3c_conv_1 (Conv2D) (None, 256, 34, 60) 590080 block_3b_relu[0][0]

block_3c_bn_1 (BatchNormalizati (None, 256, 34, 60) 1024 block_3c_conv_1[0][0]

block_3c_relu_1 (Activation) (None, 256, 34, 60) 0 block_3c_bn_1[0][0]

block_3c_conv_2 (Conv2D) (None, 256, 34, 60) 590080 block_3c_relu_1[0][0]

block_3c_bn_2 (BatchNormalizati (None, 256, 34, 60) 1024 block_3c_conv_2[0][0]

add_10 (Add) (None, 256, 34, 60) 0 block_3c_bn_2[0][0]
block_3b_relu[0][0]

block_3c_relu (Activation) (None, 256, 34, 60) 0 add_10[0][0]

block_3d_conv_1 (Conv2D) (None, 256, 34, 60) 590080 block_3c_relu[0][0]

block_3d_bn_1 (BatchNormalizati (None, 256, 34, 60) 1024 block_3d_conv_1[0][0]

block_3d_relu_1 (Activation) (None, 256, 34, 60) 0 block_3d_bn_1[0][0]

block_3d_conv_2 (Conv2D) (None, 256, 34, 60) 590080 block_3d_relu_1[0][0]

block_3d_bn_2 (BatchNormalizati (None, 256, 34, 60) 1024 block_3d_conv_2[0][0]

add_11 (Add) (None, 256, 34, 60) 0 block_3d_bn_2[0][0]
block_3c_relu[0][0]

block_3d_relu (Activation) (None, 256, 34, 60) 0 add_11[0][0]

block_3e_conv_1 (Conv2D) (None, 256, 34, 60) 590080 block_3d_relu[0][0]

block_3e_bn_1 (BatchNormalizati (None, 256, 34, 60) 1024 block_3e_conv_1[0][0]

block_3e_relu_1 (Activation) (None, 256, 34, 60) 0 block_3e_bn_1[0][0]

block_3e_conv_2 (Conv2D) (None, 256, 34, 60) 590080 block_3e_relu_1[0][0]

block_3e_bn_2 (BatchNormalizati (None, 256, 34, 60) 1024 block_3e_conv_2[0][0]

add_12 (Add) (None, 256, 34, 60) 0 block_3e_bn_2[0][0]
block_3d_relu[0][0]

block_3e_relu (Activation) (None, 256, 34, 60) 0 add_12[0][0]

block_3f_conv_1 (Conv2D) (None, 256, 34, 60) 590080 block_3e_relu[0][0]

block_3f_bn_1 (BatchNormalizati (None, 256, 34, 60) 1024 block_3f_conv_1[0][0]

block_3f_relu_1 (Activation) (None, 256, 34, 60) 0 block_3f_bn_1[0][0]

block_3f_conv_2 (Conv2D) (None, 256, 34, 60) 590080 block_3f_relu_1[0][0]

block_3f_bn_2 (BatchNormalizati (None, 256, 34, 60) 1024 block_3f_conv_2[0][0]

add_13 (Add) (None, 256, 34, 60) 0 block_3f_bn_2[0][0]
block_3e_relu[0][0]

block_3f_relu (Activation) (None, 256, 34, 60) 0 add_13[0][0]

block_4a_conv_1 (Conv2D) (None, 512, 34, 60) 1180160 block_3f_relu[0][0]

block_4a_bn_1 (BatchNormalizati (None, 512, 34, 60) 2048 block_4a_conv_1[0][0]

block_4a_relu_1 (Activation) (None, 512, 34, 60) 0 block_4a_bn_1[0][0]

block_4a_conv_2 (Conv2D) (None, 512, 34, 60) 2359808 block_4a_relu_1[0][0]

block_4a_conv_shortcut (Conv2D) (None, 512, 34, 60) 131584 block_3f_relu[0][0]

block_4a_bn_2 (BatchNormalizati (None, 512, 34, 60) 2048 block_4a_conv_2[0][0]

block_4a_bn_shortcut (BatchNorm (None, 512, 34, 60) 2048 block_4a_conv_shortcut[0][0]

add_14 (Add) (None, 512, 34, 60) 0 block_4a_bn_2[0][0]
block_4a_bn_shortcut[0][0]

block_4a_relu (Activation) (None, 512, 34, 60) 0 add_14[0][0]

block_4b_conv_1 (Conv2D) (None, 512, 34, 60) 2359808 block_4a_relu[0][0]

block_4b_bn_1 (BatchNormalizati (None, 512, 34, 60) 2048 block_4b_conv_1[0][0]

block_4b_relu_1 (Activation) (None, 512, 34, 60) 0 block_4b_bn_1[0][0]

block_4b_conv_2 (Conv2D) (None, 512, 34, 60) 2359808 block_4b_relu_1[0][0]

block_4b_bn_2 (BatchNormalizati (None, 512, 34, 60) 2048 block_4b_conv_2[0][0]

add_15 (Add) (None, 512, 34, 60) 0 block_4b_bn_2[0][0]
block_4a_relu[0][0]

block_4b_relu (Activation) (None, 512, 34, 60) 0 add_15[0][0]

block_4c_conv_1 (Conv2D) (None, 512, 34, 60) 2359808 block_4b_relu[0][0]

block_4c_bn_1 (BatchNormalizati (None, 512, 34, 60) 2048 block_4c_conv_1[0][0]

block_4c_relu_1 (Activation) (None, 512, 34, 60) 0 block_4c_bn_1[0][0]

block_4c_conv_2 (Conv2D) (None, 512, 34, 60) 2359808 block_4c_relu_1[0][0]

block_4c_bn_2 (BatchNormalizati (None, 512, 34, 60) 2048 block_4c_conv_2[0][0]

add_16 (Add) (None, 512, 34, 60) 0 block_4c_bn_2[0][0]
block_4b_relu[0][0]

block_4c_relu (Activation) (None, 512, 34, 60) 0 add_16[0][0]

output_bbox (Conv2D) (None, 12, 34, 60) 6156 block_4c_relu[0][0]

output_cov (Conv2D) (None, 3, 34, 60) 1539 block_4c_relu[0][0]

Total params: 21,322,319
Trainable params: 21,295,695
Non-trainable params: 26,624

Traceback (most recent call last):
File “/usr/local/bin/tlt-train-g1”, line 8, in
sys.exit(main())
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/common/magnet_train.py”, line 55, in main
File “”, line 2, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/utilities/timer.py”, line 46, in wrapped_fn
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 773, in main
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 691, in run_experiment
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 582, in train_gridbox
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/scripts/train.py”, line 221, in build_rasterizers
File “/home/vpraveen/.cache/dazel/_dazel_vpraveen/715c8bafe7816f3bb6f309cd506049bb/execroot/ai_infra/bazel-out/k8-py3-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/detectnet_v2/rasterizers/bbox_rasterizer.py”, line 95, in init
AssertionError

Morganh · August 14, 2020, 5:43am

Your new error results from missing “bbox_rasterizer_config”.
Please refer to specs inside the docker or tlt user guide.

If not fixed yet, please create a new topic since original issue is gone.

Topic		Replies	Views
An error occurred while training with TLT TAO Toolkit	11	788	October 12, 2021
Training detectnet_v2 Issue TAO Toolkit	15	1981	October 12, 2021
Error when convert kitti to tfrecord in official notebook TLT3.0 TAO Toolkit	24	1555	October 12, 2021
TFRecord creation process TAO Toolkit	6	876	October 12, 2021
TFrecord that created in tensorflow object detection API TAO Toolkit	2	765	October 12, 2021
tlt-train error when deploy mobilenet_v2 by using DetectNet TAO Toolkit	28	2546	October 12, 2021
Error when training detectnet_v2 resnet34 on tfrecord file TAO Toolkit	7	582	October 19, 2022
Detectnetv2 tfrecords error TAO Toolkit	4	472	January 13, 2024
SSD terminating training due to invalid loss TAO Toolkit	5	1157	October 12, 2021
Eval batch size always 0 TAO Toolkit	8	595	October 12, 2021

ValueError: No dataset tfrecords file found at path

Using TensorFlow backend. 2020-08-13 10:43:50.381686: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

NOTE: You can disable this warning by setting the MCA parameter btl_base_warn_component_unused to 0.

Using TensorFlow backend. 2020-08-14 05:26:32.386774: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

NOTE: You can disable this warning by setting the MCA parameter btl_base_warn_component_unused to 0.

Layer (type) Output Shape Param # Connected to

output_cov (Conv2D) (None, 3, 34, 60) 1539 block_4c_relu[0][0]

Related topics

Using TensorFlow backend.
2020-08-13 10:43:50.381686: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.

Using TensorFlow backend.
2020-08-14 05:26:32.386774: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.