Tensor reshape error when evaluating TrafficCamNet

Hello, I am running into an issue similar to this one when trying to run evaluation with an unpruned, un-trained, TrafficCamNet from detectnet_v2 with TAO toolkit version 4.0.1.

Command:

%%bash -s "$tao_container_id" "$TRAFFIC_CAM_MODEL_SPEC_PATH" "$TRAFFIC_CAM_MODEL_PATH" "$KEY"

docker exec $1 detectnet_v2 evaluate --experiment_spec_file $2 --model_file $3 --key $4 --framework tlt

produces the following exception:

2023-07-28 15:40:44.038041: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Using TensorFlow backend.
2023-07-28 15:40:52,430 [INFO] iva.detectnet_v2.spec_handler.spec_loader: Merging specification from /workspace/tao-eval/specs/traffic_cam_model_spec.txt
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2023-07-28 15:40:52,435 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:153: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

2023-07-28 15:40:53,235 [INFO] root: Loading model weights.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

2023-07-28 15:40:53,678 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

2023-07-28 15:40:53,689 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

2023-07-28 15:40:53,721 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:1834: The name tf.nn.fused_batch_norm is deprecated. Please use tf.compat.v1.nn.fused_batch_norm instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

2023-07-28 15:40:54,945 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:174: The name tf.get_default_session is deprecated. Please use tf.compat.v1.get_default_session instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

2023-07-28 15:40:54,945 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:190: The name tf.global_variables is deprecated. Please use tf.compat.v1.global_variables instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

2023-07-28 15:40:54,946 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:199: The name tf.is_variable_initialized is deprecated. Please use tf.compat.v1.is_variable_initialized instead.

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

2023-07-28 15:40:55,298 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:206: The name tf.variables_initializer is deprecated. Please use tf.compat.v1.variables_initializer instead.

/usr/local/lib/python3.6/dist-packages/keras/engine/saving.py:292: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.
  warnings.warn('No training configuration found in save file: '
2023-07-28 15:40:56,006 [INFO] iva.detectnet_v2.objectives.bbox_objective: Default L1 loss function will be used.
2023-07-28 15:40:56,006 [INFO] root: Building dataloader.
2023-07-28 15:40:57,331 [INFO] root: Sampling mode of the dataloader was set to user_defined.
2023-07-28 15:40:57,332 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Serial augmentation enabled = False
2023-07-28 15:40:57,332 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Pseudo sharding enabled = False
2023-07-28 15:40:57,332 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: Max Image Dimensions (all sources): (0, 0)
2023-07-28 15:40:57,332 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: number of cpus: 24, io threads: 48, compute threads: 24, buffered batches: 4
2023-07-28 15:40:57,333 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: total dataset size 10552, number of sources: 1, batch size per gpu: 4, steps: 2638
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

2023-07-28 15:40:57,373 [WARNING] tensorflow: From /usr/local/lib/python3.6/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

WARNING:tensorflow:Entity <bound method DriveNetTFRecordsParser.__call__ of <iva.detectnet_v2.dataloader.drivenet_dataloader.DriveNetTFRecordsParser object at 0x7ff743a1fe10>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method DriveNetTFRecordsParser.__call__ of <iva.detectnet_v2.dataloader.drivenet_dataloader.DriveNetTFRecordsParser object at 0x7ff743a1fe10>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-07-28 15:40:57,415 [WARNING] tensorflow: Entity <bound method DriveNetTFRecordsParser.__call__ of <iva.detectnet_v2.dataloader.drivenet_dataloader.DriveNetTFRecordsParser object at 0x7ff743a1fe10>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method DriveNetTFRecordsParser.__call__ of <iva.detectnet_v2.dataloader.drivenet_dataloader.DriveNetTFRecordsParser object at 0x7ff743a1fe10>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-07-28 15:40:57,439 [INFO] iva.detectnet_v2.dataloader.default_dataloader: Bounding box coordinates were detected in the input specification! Bboxes will be automatically converted to polygon coordinates.
2023-07-28 15:40:57,747 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: shuffle: False - shard 0 of 1
2023-07-28 15:40:57,755 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: sampling 1 datasets with weights:
2023-07-28 15:40:57,755 [INFO] modulus.blocks.data_loaders.multi_source_loader.data_loader: source: 0 weight: 1.000000
WARNING:tensorflow:Entity <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7ff71fff6dd8>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7ff71fff6dd8>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-07-28 15:40:57,774 [WARNING] tensorflow: Entity <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7ff71fff6dd8>> could not be transformed and will be executed as-is. Please report this to the AutoGraph team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output. Cause: Unable to locate the source code of <bound method Processor.__call__ of <modulus.blocks.data_loaders.multi_source_loader.processors.asset_loader.AssetLoader object at 0x7ff71fff6dd8>>. Note that functions defined in certain environments, like the interactive Python shell do not expose their source code. If that is the case, you should to define them in a .py source file. If you are certain the code is graph-compatible, wrap the call using @tf.autograph.do_not_convert. Original error: could not get source code
2023-07-28 15:40:58,081 [INFO] iva.detectnet_v2.evaluation.build_evaluator: Found 10552 samples in validation set
Traceback (most recent call last):
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1607, in _create_c_op
    c_op = c_api.TF_FinishOperation(op_desc)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot reshape a tensor with 130560 elements to shape [4,1,4,34,60] (32640 elements) for 'reshape_1_1/Reshape' (op: 'Reshape') with input shapes: [4,16,34,60], [5] and with input tensors computed as partial shapes: input[1] = [4,1,4,34,60].

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "</usr/local/lib/python3.6/dist-packages/iva/detectnet_v2/scripts/evaluate.py>", line 3, in <module>
  File "<frozen iva.detectnet_v2.scripts.evaluate>", line 204, in <module>
  File "<frozen iva.detectnet_v2.scripts.evaluate>", line 194, in <module>
  File "<decorator-gen-2>", line 2, in main
  File "<frozen iva.detectnet_v2.utilities.timer>", line 46, in wrapped_fn
  File "<frozen iva.detectnet_v2.scripts.evaluate>", line 177, in main
  File "<frozen iva.detectnet_v2.evaluation.build_evaluator>", line 158, in build_evaluator_for_trained_gridbox
  File "<frozen iva.detectnet_v2.model.utilities>", line 30, in _fn_wrapper
  File "<frozen iva.detectnet_v2.model.detectnet_model>", line 736, in build_validation_graph
  File "<frozen iva.detectnet_v2.model.utilities>", line 30, in _fn_wrapper
  File "<frozen iva.detectnet_v2.model.detectnet_model>", line 690, in build_inference_graph
  File "<frozen iva.detectnet_v2.model.detectnet_model>", line 306, in predictions_to_dict
  File "<frozen iva.detectnet_v2.objectives.base_objective>", line 98, in reshape_output
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/base_layer.py", line 457, in __call__
    output = self.call(inputs, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/keras/layers/core.py", line 401, in call
    return K.reshape(inputs, (K.shape(inputs)[0],) + self.target_shape)
  File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 1969, in reshape
    return tf.reshape(x, shape)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/array_ops.py", line 131, in reshape
    result = gen_array_ops.reshape(tensor, shape, name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/ops/gen_array_ops.py", line 8115, in reshape
    "Reshape", tensor=tensor, shape=shape, name=name)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/util/deprecation.py", line 513, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1770, in __init__
    control_input_ops)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/framework/ops.py", line 1610, in _create_c_op
    raise ValueError(str(e))
ValueError: Cannot reshape a tensor with 130560 elements to shape [4,1,4,34,60] (32640 elements) for 'reshape_1_1/Reshape' (op: 'Reshape') with input shapes: [4,16,34,60], [5] and with input tensors computed as partial shapes: input[1] = [4,1,4,34,60].
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Telemetry data couldn't be sent, but the command ran successfully.
[WARNING]: __init__() missing 4 required positional arguments: 'code', 'msg', 'hdrs', and 'fp'
Execution status: FAIL

Please provide the following information when requesting support.

• Hardware: Tesla T4
• Network Type: Detectnet_v2/TrafficCamNet) (unpruned)
• Training spec file (see below)

random_seed: 23

model_config {
    arch: "resnet"
    pretrained_model_file: "/workspace/tao-eval/trafficcamnet/unpruned_v1.0/trafficcamnet_vunpruned_v1.0/resnet18_trafficcamnet.tlt"
    num_layers: 18
    use_batch_norm: true
    objective_set {
        bbox {
            scale: 35.0
            offset: 0.5
        }
        cov {
        }
    }
    training_precision {
        backend_floatx: FLOAT32
    }
    all_projections: true
}

evaluation_config {
    validation_period_during_training: 10
    first_validation_epoch: 30
    average_precision_mode: INTEGRATE
    minimum_detection_ground_truth_overlap {
        key: "car"
        value: 0.6
    }
    evaluation_box_config {
        key: "car"
        value {
            minimum_height: 20
            maximum_height: 9999
            minimum_width: 10
            maximum_width: 9999
        }
    }
}

dataset_config {
    data_sources {
        tfrecords_path: "/workspace/tao-eval/data/tfrecords/*"
        image_directory_path: "/workspace/tao-eval/coco/data"
    }
    image_extension: "jpg"
    target_class_mapping {
        key: "car"
        value: "car"
    }
  validation_fold: 0
}

augmentation_config {
    preprocessing {
        output_image_width: 960
        output_image_height: 544
        min_bbox_width: 1.0
        min_bbox_height: 1.0
        output_image_channel: 3
        enable_auto_resize: true
    }
    spatial_augmentation {
        hflip_probability: 0.5
        zoom_min: 1.0
        zoom_max: 1.0
        translate_max_x: 8.0
        translate_max_y: 8.0
    }
    color_augmentation {
        hue_rotation_max: 25.0
        saturation_shift_max: 0.2
        contrast_scale_max: 0.1
        contrast_center: 0.5
    }
}

# Taken from notebook example:
postprocessing_config {
    target_class_config {
        key: "car"
        value {
            clustering_config {
                clustering_algorithm: DBSCAN
                dbscan_confidence_threshold: 0.9
                coverage_threshold: 0.0
                dbscan_eps: 0.2
                dbscan_min_samples: 0.05
                minimum_bounding_box_height: 20
            }
        }
    }
}

# Taken from notebook example:
cost_function_config {
    target_classes {
        name: "car"
        class_weight: 1.0
        coverage_foreground_weight: 0.05
        objectives {
            name: "cov"
            initial_weight: 1.0
            weight_target: 1.0
        }
        objectives {
            name: "bbox"
            initial_weight: 10.0
            weight_target: 10.0
        }
    }
}

# Taken from notebook example:
training_config {
    batch_size_per_gpu: 4
    num_epochs: 120
    learning_rate {
        soft_start_annealing_schedule {
            min_learning_rate: 5e-06
            max_learning_rate: 5e-04
            soft_start: 0.1
            annealing: 0.7
        }
    }
    regularizer {
        type: L1
        weight: 3e-09
    }
    optimizer {
        adam {
            epsilon: 10e-09
            beta1: 0.9
            beta2: 1.0
        }
    }
    cost_scaling {
        initial_exponent: 20.0
        increment: 0.005
        decrement: 1.0
    }
    visualizer {
        enabled: true
        num_images: 3
        scalar_logging_frequency: 50
        infrequent_logging_frequency: 5
        target_class_config {
            key: "car"
            value: {
                coverage_threshold: 0.005
            }
        }
    }
    checkpoint_interval: 10
}

bbox_rasterizer_config {
    target_class_config {
        key: "car"
        value {
            cov_center_x: 0.5
            cov_center_y: 0.5
            cov_radius_x: 1.0
            cov_radius_y: 1.0
            bbox_min_radius: 1.0
        }
    }
    deadzone_radius: 0.4
}

My coco_config file is used for dataset conversion:
coco_traffic_cam_spec.txt:

coco_config {
    root_directory_path: "/workspace/tao-eval/coco/"
    img_dir_names: ["data"]
    annotation_files: ["labels.json"]
    num_partitions: 1
    num_shards: [8]
}
image_directory_path: "/workspace/tao-eval/coco/"
target_class_mapping {
    key: "vehicle" # Label from dataset
    value: "car" # Label in TrafficCamNet
}

which works properly for the following command:

%%bash -s "$tao_container_id" "$TRAFFIC_CAM_DATA_SPEC_PATH" "$CONVERT_OUTPUT_DIR"

docker exec $1 detectnet_v2 dataset_convert --dataset_export_spec $2 --output_filename $3

with output:

WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Using TensorFlow backend.
2023-07-28 15:40:11,559 [INFO] iva.detectnet_v2.dataio.build_converter: Instantiating a coco converter
2023-07-28 15:40:12,287 [INFO] iva.detectnet_v2.dataio.coco_converter_lib: Writing partition 0, shard 0
2023-07-28 15:40:12,613 [INFO] iva.detectnet_v2.dataio.coco_converter_lib: Writing partition 0, shard 1
2023-07-28 15:40:12,951 [INFO] iva.detectnet_v2.dataio.coco_converter_lib: Writing partition 0, shard 2
2023-07-28 15:40:13,293 [INFO] iva.detectnet_v2.dataio.coco_converter_lib: Writing partition 0, shard 3
2023-07-28 15:40:13,612 [INFO] iva.detectnet_v2.dataio.coco_converter_lib: Writing partition 0, shard 4
2023-07-28 15:40:13,925 [INFO] iva.detectnet_v2.dataio.coco_converter_lib: Writing partition 0, shard 5
2023-07-28 15:40:14,214 [INFO] iva.detectnet_v2.dataio.coco_converter_lib: Writing partition 0, shard 6
2023-07-28 15:40:14,489 [INFO] iva.detectnet_v2.dataio.coco_converter_lib: Writing partition 0, shard 7
2023-07-28 15:40:14,765 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: 
Wrote the following numbers of objects:
b'vehicle': 38956

2023-07-28 15:40:14,765 [INFO] iva.detectnet_v2.dataio.coco_converter_lib: Writing partition 1, shard 0
2023-07-28 15:40:15,097 [INFO] iva.detectnet_v2.dataio.coco_converter_lib: Writing partition 1, shard 1
2023-07-28 15:40:15,439 [INFO] iva.detectnet_v2.dataio.coco_converter_lib: Writing partition 1, shard 2
2023-07-28 15:40:15,758 [INFO] iva.detectnet_v2.dataio.coco_converter_lib: Writing partition 1, shard 3
2023-07-28 15:40:16,084 [INFO] iva.detectnet_v2.dataio.coco_converter_lib: Writing partition 1, shard 4
2023-07-28 15:40:16,406 [INFO] iva.detectnet_v2.dataio.coco_converter_lib: Writing partition 1, shard 5
2023-07-28 15:40:16,695 [INFO] iva.detectnet_v2.dataio.coco_converter_lib: Writing partition 1, shard 6
2023-07-28 15:40:16,983 [INFO] iva.detectnet_v2.dataio.coco_converter_lib: Writing partition 1, shard 7
2023-07-28 15:40:17,276 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: 
Wrote the following numbers of objects:
b'vehicle': 38956

2023-07-28 15:40:17,276 [INFO] iva.detectnet_v2.dataio.coco_converter_lib: Cumulative object statistics
2023-07-28 15:40:17,276 [INFO] iva.detectnet_v2.dataio.dataset_converter_lib: 
Wrote the following numbers of objects:
b'vehicle': 77912

2023-07-28 15:40:17,276 [INFO] iva.detectnet_v2.dataio.coco_converter_lib: Class map. 
Label in GT: Label in tfrecords file 
vehicle: vehicle
2023-07-28 15:40:17,276 [INFO] iva.detectnet_v2.dataio.coco_converter_lib: Tfrecords generation complete.
loading annotations into memory...
Done (t=0.26s)
creating index...
index created!
loading annotations into memory...
Done (t=0.42s)
creating index...
index created!
For the dataset_config in the experiment_spec, please use labels in the tfrecords file, while writing the classmap.

Note that in my training spec I have preprocessing in augmentation_config set to the output dimensions expected by TrafficCamNet, which is 960x544.

My TFRecord input files are of uniform dimension (640x480) with a single label class. Dimension uniformity confirmed by this script:

import tensorflow as tf
import os

tfrecord_dir = "/workspace/tao-eval/data/tfrecords/"
files = os.listdir(tfrecord_dir)

expected_frame_width = 640
expected_frame_height = 480
found_width = []
found_height = []
for file in files:
    full_path = tfrecord_dir + file
    tfrecords_iter = tf.python_io.tf_record_iterator(full_path)

    for record in tfrecords_iter:
        example = tf.train.Example()
        example.ParseFromString(record)

        features = example.features.feature

        frame_width = features["frame/width"].int64_list.value
        frame_height = features["frame/height"].int64_list.value
        if frame_width[0] not in found_width:
            found_width.append(frame_width[0])

        if frame_height[0] not in found_height:
            found_height.append(frame_height[0])
        if frame_width[0] != expected_frame_width:
            print("The frame_id is: {}\n".format(frame_id))
            print("The frame_width is:{}\n".format(frame_width))
            print("The frame_height:{}\n".format(frame_height))
        elif frame_height[0] != expected_frame_height:
            print("The frame_id is: {}\n".format(frame_id))
            print("The frame_width is:{}\n".format(frame_width))
            print("The frame_height:{}\n".format(frame_height))

print(found_width)
print(found_height)

I expect the resizing from (640x480) to the needed (960x544) to be taken care of by the enable_auto_resize: true in the augmentation_config.preprocessing block.

So my guess is my issue is related to the fact that my source dataset has 5 classes (the only one I care about is vehicle, so I only map ‘vehicle’ to ‘car’ in target_class_mapping, but TrafficCamNet was trained on 4 classes. Are there best practices for dealing with unused classes from the source dataset? I have also tried evaluation with a subset of my training data that consists of only “vehicle” detections with the same issue.

Last thing: it seems that you usually run evaluate after train, but I am running evaluate first in order to establish a performance baseline before fine-tuning.

Any advice appreciated. Thank you.

What is the exact model for "$3 "?
Can you share its full name?

Hi Morganh,

I am using the following command to download the model:

ngc registry model download-version nvidia/tao/trafficcamnet:unpruned_v1.0

using ngc version: NGC CLI 3.25.0

This results in a file named resnet18_trafficcamnet.tlt that I reference in the model config spec. When running the evaluate command, I pass this .tlt file to the --model_file argument.

Thank you for your help.

Hi @Morganh ,

I figured out that the issue in my case was that I had to specify all of the classes that TrafficCamNet was expecting inside of the model/dataset config file. Since TrafficCamNet was trained on 4 different classes, it seems to require 4 classes in the configuration.

Thank you

OK, could you please share your latest spec file? Thanks a lot.

@Morganh sure, please see below. Something I’m still wondering about best practices in the (probably common) scenario in which there is only partial overlap of classes from the original model dataset and the fine-tuning dataset. For example, if the model is trained on classes (A, B, C), and the fine-tuning dataset contains classes (C, D, E), and one wants to train/evaluate on the intersection only, (class C), what is the proper way to specify this? Of course there is the target class mapping, e.g. “human” → “person”, but seemingly no documentation on handling unused classes from the fine-tuning data, or satisfying class expectations from the model side. Hopefully that makes sense. Thanks for your help.

random_seed: 23

model_config {
    arch: "resnet"
    pretrained_model_file: "/workspace/trafficcamnet/unpruned_v1.0/trafficcamnet_vunpruned_v1.0/resnet18_trafficcamnet.tlt"
    num_layers: 18
    use_batch_norm: true
    objective_set {
        bbox {
            scale: 35.0
            offset: 0.5
        }
        cov {
        }
    }
    training_precision {
        backend_floatx: FLOAT32
    }
    all_projections: true
}

evaluation_config {
    validation_period_during_training: 10
    first_validation_epoch: 30
    average_precision_mode: INTEGRATE
    minimum_detection_ground_truth_overlap {
        key: "car"
        value: 0.6
    }
    minimum_detection_ground_truth_overlap {
        key: "bicycle"
        value: 0.6
    }
    minimum_detection_ground_truth_overlap {
        key: "road_sign"
        value: 0.6
    }
    minimum_detection_ground_truth_overlap {
        key: "person"
        value: 0.6
    }
    evaluation_box_config {
        key: "car"
        value {
            minimum_height: 20
            maximum_height: 9999
            minimum_width: 10
            maximum_width: 9999
        }
    }
    evaluation_box_config {
        key: "bicycle"
        value {
            minimum_height: 20
            maximum_height: 9999
            minimum_width: 10
            maximum_width: 9999
        }
    }
    evaluation_box_config {
        key: "road_sign"
        value {
            minimum_height: 20
            maximum_height: 9999
            minimum_width: 10
            maximum_width: 9999
        }
    }
    evaluation_box_config {
        key: "person"
        value {
            minimum_height: 20
            maximum_height: 9999
            minimum_width: 10
            maximum_width: 9999
        }
    }
}

dataset_config {
    data_sources {
        tfrecords_path: "/workspace/data/tfrecords/*"
        image_directory_path: "/workspace/coco-dataset/“
    }
    image_extension: "jpg"
    target_class_mapping {
        key: "car"
        value: "car"
    }
    target_class_mapping {
        key: "bicycle"
        value: "bicycle"
    }
  validation_fold: 0
}

augmentation_config {
    preprocessing {
        output_image_width: 960
        output_image_height: 544
        min_bbox_width: 1.0
        min_bbox_height: 1.0
        output_image_channel: 3
        enable_auto_resize: true
    }
    spatial_augmentation {
        hflip_probability: 0.5
        zoom_min: 1.0
        zoom_max: 1.0
        translate_max_x: 8.0
        translate_max_y: 8.0
    }
    color_augmentation {
        hue_rotation_max: 25.0
        saturation_shift_max: 0.2
        contrast_scale_max: 0.1
        contrast_center: 0.5
    }
}

postprocessing_config {
    target_class_config {
        key: "car"
        value {
            clustering_config {
                clustering_algorithm: DBSCAN
                dbscan_confidence_threshold: 0.9
                coverage_threshold: 0.0
                dbscan_eps: 0.2
                dbscan_min_samples: 0.05
                minimum_bounding_box_height: 20
            }
        }
    }
    target_class_config {
        key: "bicycle"
        value {
            clustering_config {
                clustering_algorithm: DBSCAN
                dbscan_confidence_threshold: 0.9
                coverage_threshold: 0.0
                dbscan_eps: 0.2
                dbscan_min_samples: 0.05
                minimum_bounding_box_height: 20
            }
        }
    }
    target_class_config {
        key: "road_sign"
        value {
            clustering_config {
                clustering_algorithm: DBSCAN
                dbscan_confidence_threshold: 0.9
                coverage_threshold: 0.0
                dbscan_eps: 0.2
                dbscan_min_samples: 0.05
                minimum_bounding_box_height: 20
            }
        }
    }
    target_class_config {
        key: "person"
        value {
            clustering_config {
                clustering_algorithm: DBSCAN
                dbscan_confidence_threshold: 0.9
                coverage_threshold: 0.0
                dbscan_eps: 0.2
                dbscan_min_samples: 0.05
                minimum_bounding_box_height: 20
            }
        }
    }
}

cost_function_config {
    target_classes {
        name: "car"
        class_weight: 1.0
        coverage_foreground_weight: 0.05
        objectives {
            name: "cov"
            initial_weight: 1.0
            weight_target: 1.0
        }
        objectives {
            name: "bbox"
            initial_weight: 10.0
            weight_target: 10.0
        }
    }
    target_classes {
        name: "bicycle"
        class_weight: 1.0
        coverage_foreground_weight: 0.05
        objectives {
            name: "cov"
            initial_weight: 1.0
            weight_target: 1.0
        }
        objectives {
            name: "bbox"
            initial_weight: 10.0
            weight_target: 10.0
        }
    }
    target_classes {
        name: "road_sign"
        class_weight: 1.0
        coverage_foreground_weight: 0.05
        objectives {
            name: "cov"
            initial_weight: 1.0
            weight_target: 1.0
        }
        objectives {
            name: "bbox"
            initial_weight: 10.0
            weight_target: 10.0
        }
    }
    target_classes {
        name: "person"
        class_weight: 1.0
        coverage_foreground_weight: 0.05
        objectives {
            name: "cov"
            initial_weight: 1.0
            weight_target: 1.0
        }
        objectives {
            name: "bbox"
            initial_weight: 10.0
            weight_target: 10.0
        }
    }
}

training_config {
    batch_size_per_gpu: 4
    num_epochs: 120
    learning_rate {
        soft_start_annealing_schedule {
            min_learning_rate: 5e-06
            max_learning_rate: 5e-04
            soft_start: 0.1
            annealing: 0.7
        }
    }
    regularizer {
        type: L1
        weight: 3e-09
    }
    optimizer {
        adam {
            epsilon: 10e-09
            beta1: 0.9
            beta2: 1.0
        }
    }
    cost_scaling {
        initial_exponent: 20.0
        increment: 0.005
        decrement: 1.0
    }
    visualizer {
        enabled: true
        num_images: 3
        scalar_logging_frequency: 50
        infrequent_logging_frequency: 5
        target_class_config {
            key: "car"
            value: {
                coverage_threshold: 0.005
            }
        }
        target_class_config {
            key: "bicycle"
            value: {
                coverage_threshold: 0.005
            }
        }
        target_class_config {
            key: "road_sign"
            value: {
                coverage_threshold: 0.005
            }
        }
    }
    checkpoint_interval: 10
}

bbox_rasterizer_config {
    target_class_config {
        key: "car"
        value {
            cov_center_x: 0.5
            cov_center_y: 0.5
            cov_radius_x: 1.0
            cov_radius_y: 1.0
            bbox_min_radius: 1.0
        }
    }
    target_class_config {
        key: "bicycle"
        value {
            cov_center_x: 0.5
            cov_center_y: 0.5
            cov_radius_x: 1.0
            cov_radius_y: 1.0
            bbox_min_radius: 1.0
        }
    }
    target_class_config {
        key: "road_sign"
        value {
            cov_center_x: 0.5
            cov_center_y: 0.5
            cov_radius_x: 1.0
            cov_radius_y: 1.0
            bbox_min_radius: 1.0
        }
    }
    target_class_config {
        key: "person"
        value {
            cov_center_x: 0.5
            cov_center_y: 0.5
            cov_radius_x: 1.0
            cov_radius_y: 1.0
            bbox_min_radius: 1.0
        }
    }
    deadzone_radius: 0.4
}

Hi @Morganh unrelated issue, but given my spec file is already provided, do you see any reason why evaluation here would run extremely slow, even when specifying a GPU? For example it’s taking over 90 minutes to evaluate with a T4 on about 10K images at batch size 32. GPU memory usage stays at 0% except for rare peaks. Similar to GPU utilization is 0% during evaluation. Thank you.

How about running below?
$ nvidia-smi

Hi @Morganh ,

thank you for your reply.

When I run nvidia-smi -l 1 to monitor live GPU usage, memory use stays at 0% except for rare peaks, similar to the user in the thread linked above.

Could you share the result of nvidia-smi? What is the driver?

@Morganh please see below. Thanks

$ nvidia-smi
Tue Aug  1 16:15:42 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.116.04   Driver Version: 525.116.04   CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:05.0 Off |                    0 |
| N/A   76C    P0    32W /  70W |    271MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Did you add --runtime=nvidia when run docker run command?

@Morganh I currently just use tao detectnet_v2 to start the container and then docker exec to run TAO commands in that container from the host. I would expect a TAO container to use the Nvidia container runtime by default - is that not the case? Do you have any recommendations on that front? It seems that inference runs quickly and consumes VRAM. Do you think it could be an issue with my evaluation or augmentation config? Thank you.

To narrow down, please try to run the default notebook GPU-optimized AI, Machine Learning, & HPC Software | NVIDIA NGC against the default KITTI dataset mentioned in it. Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.