Error while evaluating with a separate validation set

Hello! I have two separate datasets; one for training (RGB with png images) and one for validation (grayscale with jpg). I was able to get training to work by commenting out validation_data_source, but reenabling it for tlt-evaluate crashes for some undocumented reason. Here is the stack trace:

root@00cf90c34bcf:/workspace/tlt-test# tlt-evaluate retinanet -k results/key -m results/weights/retinanet_resnet_epoch_010.tlt -e retinanet_spec 
Using TensorFlow backend.
2020-07-07 18:03:28,665 [INFO] iva.retinanet.scripts.evaluate: Loading experiment spec at retinanet_spec.
2020-07-07 18:03:28,666 [INFO] /usr/local/lib/python2.7/dist-packages/iva/retinanet/utils/spec_loader.pyc: Merging specification from retinanet_spec
Traceback (most recent call last):
File "/usr/local/bin/tlt-evaluate", line 8, in <module>
    sys.exit(main())
File "./common/magnet_evaluate.py", line 42, in main
File "./retinanet/scripts/evaluate.py", line 114, in main
File "./retinanet/scripts/evaluate.py", line 86, in evaluate
File "./retinanet/builders/data_generator.py", line 51, in __init__
File "./detectnet_v2/dataloader/default_dataloader.py", line 206, in get_dataset_tensors
File "./detectnet_v2/dataloader/default_dataloader.py", line 232, in _generate_images_and_ground_truth_labels
File "./modulus/processors/processors.py", line 227, in __call__
File "./detectnet_v2/dataloader/utilities.py", line 60, in call
File "./modulus/processors/tfrecords_iterator.py", line 143, in process_records
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 1508, in split
    axis=axis, num_split=num_or_size_splits, value=value, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 8883, in split
    "Split", split_dim=axis, value=value, num_split=num_split, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 709, in _apply_op_helper
    (key, op_type_name, attr_value.i, attr_def.minimum))
ValueError: Attr 'num_split' of 'Split' Op passed 0 less than minimum 1.

Here is my training spec:

# TODO Tune this; this is copied-and-pasted
eval_config {
	validation_period_during_training: 10
	# average_precision_mode: SAMPLE
	matching_iou_threshold: 0.5
}

# TODO Tune this; this is copied-and-pasted
nms_config {
 confidence_threshold: 0.05
 clustering_iou_threshold: 0.5
 top_k: 200
}

# TODO Tune this
augmentation_config {
  preprocessing {
    output_image_width: 640
    output_image_height: 640
    output_image_channel: 3
    min_bbox_width: 1.0
    min_bbox_height: 1.0
  }
  spatial_augmentation {
    hflip_probability: 0.5
    vflip_probability: 0.0
    zoom_min: 0.7
    zoom_max: 1.8
    translate_max_x: 8.0
    translate_max_y: 8.0
  }
}

dataset_config {
	data_sources: {
		tfrecords_path: "/workspace/tlt-test/dataset/syn/syn_tf/*"
		image_directory_path: "/workspace/tlt-test/dataset/syn"
	}
	target_class_mapping {
		key: "a"
		value: "a"
	}
	target_class_mapping {
		key: "b"
		value: "b"
	}
	image_extension: "jpg"
	validation_data_source: {
		tfrecords_path: "/workspace/tlt-test/dataset/real/real_tf/*"
		image_directory_path: "/workspace/tlt-test/dataset/real/images"
	}
}

# TODO Tune this; this is copied-and-pasted
retinanet_config { 	 
	aspect_ratios_global: "[1.0, 2.0, 0.5]"
	scales: "[0.05, 0.15, 0.3, 0.45, 0.6, 0.75]"                                                                
	two_boxes_for_ar1: false                                   	 
	clip_boxes: false                                                                                
	loss_loc_weight: 1.0                                       	 
	focal_loss_alpha: 0.25                                                                          
	focal_loss_gamma: 2.0      	 
	variances: "[0.1, 0.1, 0.2, 0.2]"                                                                     
	arch: "resnet"
	nlayers: 18
	n_kernels: 2                                                                                         
	feature_size: 256
	freeze_bn: False                                                                                     
	freeze_blocks: 0
}

training_config {
	batch_size_per_gpu: 12
	num_epochs: 10
	learning_rate {
		soft_start_annealing_schedule {
			min_learning_rate: 5e-06
			max_learning_rate: 0.0005
			soft_start: 0.1
			annealing: 0.7
		}
	}
	regularizer {
		type: L1
		weight: 3e-09
	}
	optimizer {
		adam {
			epsilon: 9.9e-09
			beta1: 0.9
			beta2: 0.999
		}
	}
	cost_scaling {
		initial_exponent: 20.0
		increment: 0.005
		decrement: 1.0
	}
	checkpoint_interval: 10
}

Real data configuration:

kitti_config {
  root_directory_path: "/workspace/tlt-test/dataset/yuma"
  image_dir_name: "test"
  label_dir_name: "kitti_labels"
  image_extension: ".jpg"
  partition_mode: "random"
  num_partitions: 2
  val_split: 90
  num_shards: 10
}
image_directory_path: "/workspace/tlt-test/dataset/yuma"

If you change the spec from

validation_data_source: {
tfrecords_path: “/workspace/tlt-test/dataset/real/real_tf/*”
image_directory_path: “/workspace/tlt-test/dataset/real/images”
}

to

validation_fold: 0

Can below work?

$ tlt-evaluate retinanet -k results/key -m results/weights/retinanet_resnet_epoch_010.tlt -e retinanet_spec

Morgan, thank you very much for the response. Unfortunately, adding validation_fold and removing validation_data_source gives the same error. So does leaving the source in. In fact, the only way the error does not appear is when commenting out both validation_data_source and validation_fold, in which case the command fails anyway due to the lack of a validation set to work with.

I am afraid there is something wrong in your tfrecords.
Could you please run

$ ls -l <your_tfrecords_folder>

Could you paste the log when you run tlt-dataset-convert?

In your spec, there is 90% of images are val data. How many images totally?

val_split: 90

Thank you for the insight. I was trying to use tlt-dataset-convert to create separate tfrecords for evaluation. I followed the docs’ recommendation that “partition_mode” be set to random with an arbitrary train/val split. Assuming they (the docs) are up-to-date and complete, shouldn’t my choice of a 90% val_split be irrelevant? Anyway, I tried it with a 10% split just in case - no luck.

Total images are 10126.

Running tlt-dataset-convert:

Using TensorFlow backend.
2020-07-09 16:45:56,243 - iva.detectnet_v2.dataio.build_converter - INFO - Instantiating a kitti converter
2020-07-09 16:45:56,330 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Num images in
Train: 1013	Val: 9113
2020-07-09 16:45:56,330 - iva.detectnet_v2.dataio.kitti_converter_lib - INFO - Validation data in partition 0. Hence, while choosing the validationset during training choose validation_fold 0.
2020-07-09 16:45:56,334 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 0
/usr/local/lib/python2.7/dist-packages/iva/detectnet_v2/dataio/kitti_converter_lib.py:266: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
2020-07-09 16:45:58,060 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 1
2020-07-09 16:45:59,541 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 2
2020-07-09 16:46:00,951 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 3
2020-07-09 16:46:02,316 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 4
2020-07-09 16:46:03,727 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 5
2020-07-09 16:46:05,261 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 6
2020-07-09 16:46:06,766 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 7
2020-07-09 16:46:08,221 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 8
2020-07-09 16:46:09,683 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 0, shard 9
2020-07-09 16:46:11,228 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - 
Wrote the following numbers of objects:
a: 21253
b: 3832

2020-07-09 16:46:11,228 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 0
2020-07-09 16:46:11,406 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 1
2020-07-09 16:46:11,575 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 2
2020-07-09 16:46:11,754 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 3
2020-07-09 16:46:11,914 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 4
2020-07-09 16:46:12,067 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 5
2020-07-09 16:46:12,221 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 6
2020-07-09 16:46:12,389 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 7
2020-07-09 16:46:12,559 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 8
2020-07-09 16:46:12,738 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Writing partition 1, shard 9
2020-07-09 16:46:12,916 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - 
Wrote the following numbers of objects:
a: 2368
b: 423

2020-07-09 16:46:12,916 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Cumulative object statistics
2020-07-09 16:46:12,917 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - 
Wrote the following numbers of objects:
a: 23621
b: 4255

2020-07-09 16:46:12,917 - iva.detectnet_v2.dataio.dataset_converter_lib - INFO - Class map. 
Label in GT: Label in tfrecords file 
a: a
b: b
For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap.

An example kitti label:

a 0.0 1 0 739.0 304.0 763.0 320.0 0 0 0 0 0 0 0

Listing the tfrecords:

-rw-r--r-- 1 root root  77858 Jul  9 17:24 yuma_tf-fold-000-of-002-shard-00000-of-00010
-rw-r--r-- 1 root root  78171 Jul  9 17:24 yuma_tf-fold-000-of-002-shard-00001-of-00010
-rw-r--r-- 1 root root  75900 Jul  9 17:24 yuma_tf-fold-000-of-002-shard-00002-of-00010
-rw-r--r-- 1 root root  78052 Jul  9 17:24 yuma_tf-fold-000-of-002-shard-00003-of-00010
-rw-r--r-- 1 root root  77229 Jul  9 17:24 yuma_tf-fold-000-of-002-shard-00004-of-00010
-rw-r--r-- 1 root root  77794 Jul  9 17:24 yuma_tf-fold-000-of-002-shard-00005-of-00010
-rw-r--r-- 1 root root  78173 Jul  9 17:24 yuma_tf-fold-000-of-002-shard-00006-of-00010
-rw-r--r-- 1 root root  77474 Jul  9 17:24 yuma_tf-fold-000-of-002-shard-00007-of-00010
-rw-r--r-- 1 root root  77351 Jul  9 17:24 yuma_tf-fold-000-of-002-shard-00008-of-00010
-rw-r--r-- 1 root root  79484 Jul  9 17:24 yuma_tf-fold-000-of-002-shard-00009-of-00010
-rw-r--r-- 1 root root 697479 Jul  9 17:24 yuma_tf-fold-001-of-002-shard-00000-of-00010
-rw-r--r-- 1 root root 701998 Jul  9 17:24 yuma_tf-fold-001-of-002-shard-00001-of-00010
-rw-r--r-- 1 root root 701449 Jul  9 17:24 yuma_tf-fold-001-of-002-shard-00002-of-00010
-rw-r--r-- 1 root root 702497 Jul  9 17:24 yuma_tf-fold-001-of-002-shard-00003-of-00010
-rw-r--r-- 1 root root 700506 Jul  9 17:24 yuma_tf-fold-001-of-002-shard-00004-of-00010
-rw-r--r-- 1 root root 701412 Jul  9 17:24 yuma_tf-fold-001-of-002-shard-00005-of-00010
-rw-r--r-- 1 root root 700455 Jul  9 17:24 yuma_tf-fold-001-of-002-shard-00006-of-00010
-rw-r--r-- 1 root root 702036 Jul  9 17:24 yuma_tf-fold-001-of-002-shard-00007-of-00010
-rw-r--r-- 1 root root 697853 Jul  9 17:24 yuma_tf-fold-001-of-002-shard-00008-of-00010
-rw-r--r-- 1 root root 703858 Jul  9 17:24 yuma_tf-fold-001-of-002-shard-00009-of-00010

Anyway, thank you so much for helping me out, and I hope we can get this resolved soon.

I am little confused.

  1. In your spec,
    the training tfrecords is
	data_sources: {
  tfrecords_path: "/workspace/tlt-test/dataset/syn/syn_tf/*"
  image_directory_path: "/workspace/tlt-test/dataset/syn"

}

The val dataset tfrecord is

	validation_data_source: {
  tfrecords_path: "/workspace/tlt-test/dataset/real/real_tf/*"
  image_directory_path: "/workspace/tlt-test/dataset/real/images"

}

Please set to

	data_sources: {
  tfrecords_path: "/workspace/tlt-test/dataset/syn/syn_tf/*"
  image_directory_path: "/workspace/tlt-test/dataset/syn"

}

and

validation_fold: 0

How about the result of tlt-evaluate?

If possible, please enlarge epoch to 20 as below and trigger training again, do you see the validation result during training? The 10th epoch and 20th epoch will print the validation info.

num_epochs: 20
validation_period_during_training: 10

  1. Can you run
    $ ls -l /workspace/tlt-test/dataset/syn/syn_tf/

    and
    $ ls -l /workspace/tlt-test/dataset/real/real_tf/

  2. You said there is a training dataset (RGB with png images) and one for validation (grayscale with jpg). Which folder above-mentioned is for training(RGB)? Which folder above-mentioned is for val(grayscale)?

  1. I apologize for any confusion. Please ignore the directory name discrepancies. I am training on “syn” and testing on “yuma”. You can safely assume that the “real” directory on my machine does not exist, and that any instance of “real” has been replaced with “yuma”.
  2. $ ls -l /path/to/syn_tf :
-rw-r--r-- 1 root root  703397 Jul  6 19:54 syn_tf-fold-000-of-002-shard-00000-of-00010
-rw-r--r-- 1 root root  706982 Jul  6 19:54 syn_tf-fold-000-of-002-shard-00001-of-00010
-rw-r--r-- 1 root root  703818 Jul  6 19:54 syn_tf-fold-000-of-002-shard-00002-of-00010
-rw-r--r-- 1 root root  706918 Jul  6 19:54 syn_tf-fold-000-of-002-shard-00003-of-00010
-rw-r--r-- 1 root root  708797 Jul  6 19:54 syn_tf-fold-000-of-002-shard-00004-of-00010
-rw-r--r-- 1 root root  707580 Jul  6 19:54 syn_tf-fold-000-of-002-shard-00005-of-00010
-rw-r--r-- 1 root root  701474 Jul  6 19:54 syn_tf-fold-000-of-002-shard-00006-of-00010
-rw-r--r-- 1 root root  704718 Jul  6 19:54 syn_tf-fold-000-of-002-shard-00007-of-00010
-rw-r--r-- 1 root root  707946 Jul  6 19:55 syn_tf-fold-000-of-002-shard-00008-of-00010
-rw-r--r-- 1 root root  706990 Jul  6 19:55 syn_tf-fold-000-of-002-shard-00009-of-00010
-rw-r--r-- 1 root root 2831557 Jul  6 19:55 syn_tf-fold-001-of-002-shard-00000-of-00010
-rw-r--r-- 1 root root 2830761 Jul  6 19:55 syn_tf-fold-001-of-002-shard-00001-of-00010
-rw-r--r-- 1 root root 2828583 Jul  6 19:55 syn_tf-fold-001-of-002-shard-00002-of-00010
-rw-r--r-- 1 root root 2834077 Jul  6 19:55 syn_tf-fold-001-of-002-shard-00003-of-00010
-rw-r--r-- 1 root root 2821016 Jul  6 19:55 syn_tf-fold-001-of-002-shard-00004-of-00010
-rw-r--r-- 1 root root 2820715 Jul  6 19:55 syn_tf-fold-001-of-002-shard-00005-of-00010
-rw-r--r-- 1 root root 2827766 Jul  6 19:55 syn_tf-fold-001-of-002-shard-00006-of-00010
-rw-r--r-- 1 root root 2832791 Jul  6 19:55 syn_tf-fold-001-of-002-shard-00007-of-00010
-rw-r--r-- 1 root root 2817866 Jul  6 19:55 syn_tf-fold-001-of-002-shard-00008-of-00010
-rw-r--r-- 1 root root 2836085 Jul  6 19:55 syn_tf-fold-001-of-002-shard-00009-of-00010

(Assume real_tf directory does not exist, and that it is instead /workspace/tlt-test/dataset/yuma/yuma_tf/)

  1. “syn” is for training w/ RGB + png, and “yuma” is for val with jpg + grayscale.

  2. Here is my log if I try running tlt-train with validation_data_source or validation_fold uncommented:

root@7710997a55f8:/workspace/tlt-test# tlt-train retinanet -e retinanet_spec -r results -k results/key
Using TensorFlow backend.
2020-07-10 17:16:50,730 [INFO] iva.retinanet.scripts.train: Loading experiment spec at retinanet_spec.
2020-07-10 17:16:50,731 [INFO] /usr/local/lib/python2.7/dist-packages/iva/retinanet/utils/spec_loader.pyc: Merging specification from retinanet_spec
2020-07-10 17:16:50,736 [INFO] iva.retinanet.scripts.train: Building model from spec file...
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
target/truncation is not updated to match the crop areaif the dataset contains target/truncation.
Traceback (most recent call last):
  File "/usr/local/bin/tlt-train-g1", line 8, in <module>
    sys.exit(main())
  File "./common/magnet_train.py", line 40, in main
  File "./retinanet/scripts/train.py", line 247, in main
  File "./retinanet/scripts/train.py", line 113, in run_experiment
  File "./retinanet/builders/input_builder.py", line 73, in build
  File "./retinanet/builders/data_generator.py", line 51, in __init__
  File "./detectnet_v2/dataloader/default_dataloader.py", line 206, in get_dataset_tensors
  File "./detectnet_v2/dataloader/default_dataloader.py", line 232, in _generate_images_and_ground_truth_labels
  File "./modulus/processors/processors.py", line 227, in __call__
  File "./detectnet_v2/dataloader/utilities.py", line 60, in call
  File "./modulus/processors/tfrecords_iterator.py", line 143, in process_records
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 1508, in split
    axis=axis, num_split=num_or_size_splits, value=value, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 8883, in split
    "Split", split_dim=axis, value=value, num_split=num_split, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 709, in _apply_op_helper
    (key, op_type_name, attr_value.i, attr_def.minimum))
ValueError: Attr 'num_split' of 'Split' Op passed 0 less than minimum 1.

Could the “target/truncation…” error messages have something to do with it?
If I comment both fields, validation happens on a subset of the training set.

Please ignore “target/truncation…” error messages. It does not harm.
I still cannot understand why the training will meer error when you set

data_sources: {
tfrecords_path: “/workspace/tlt-test/dataset/syn/syn_tf/*”
image_directory_path: “/workspace/tlt-test/dataset/syn”
}

and

validation_fold: 0

With this setting, the training will use “syn_tf-fold-001-xxx” to train, and use “syn_tf-fold-000-xxx” to do validation.

Can you paste the spec for this configuration here?