Inference fails after batch size of 32

quinn · August 25, 2022, 6:47pm

I am trying to do inference at varying batch sizes and it appears that after a batch size of 32 the inference fails due to internal errors. Below is the error received for having a batch size of 33:

tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: Input to reshape is a tensor with 587520 values, but the requested shape requires a multiple of 33
	 [[{{node proposal_1/Reshape_6}}]]
	 [[proposal_1/cond_411/switch_t/_2978]]
  (1) Invalid argument: Input to reshape is a tensor with 587520 values, but the requested shape requires a multiple of 33
	 [[{{node proposal_1/Reshape_6}}]]

Note that this incompatible size does not happen with a batch size of 31, which executes normally. Here is there error with a larger batch of 64:

Traceback (most recent call last):
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/faster_rcnn/scripts/inference.py", line 301, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/faster_rcnn/scripts/inference.py", line 289, in <module>
  File "/root/.cache/bazel/_bazel_root/ed34e6d125608f91724fda23656f1726/execroot/ai_infra/bazel-out/k8-fastbuild/bin/magnet/packages/iva/build_wheel.runfiles/ai_infra/iva/faster_rcnn/scripts/inference.py", line 218, in main
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1169, in predict
    steps=steps)
  File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_arrays.py", line 294, in predict_loop
    batch_outs = f(ins_batch)
  File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2715, in __call__
    return self._call(inputs)
  File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2675, in _call
    fetched = self._callable_fn(*array_vals)
  File "/usr/local/lib/python3.6/dist-packages/tensorflow_core/python/client/session.py", line 1472, in __call__
    run_metadata_ptr)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: Incompatible shapes: [1175040] vs. [587520]
	 [[{{node proposal_1/mul_4}}]]
	 [[proposal_1/cond_122/Min/Switch/_3675]]
  (1) Invalid argument: Incompatible shapes: [1175040] vs. [587520]
	 [[{{node proposal_1/mul_4}}]]
0 successful operations.

The inference is being performed on an RTX 3090, which is only %60 utilized with a batch size of 32, leading me to believe it is not an issue with memory. Only the batch_size for the infer configurations in the experiment spec is being changed each time.

Full spec below:

random_seed: 42
        enc_key: 'zeroeyes'
        verbose: True
        model_config {
        input_image_config {
        image_type: RGB
        image_channel_order: 'bgr'
        size_height_width {
        height: 540
        width: 960
        }
            image_channel_mean {
                key: 'b'
                value: 114.244538027499
        }
            image_channel_mean {
                key: 'g'
                value: 117.13666031604566
        }
            image_channel_mean {
                key: 'r'
                value: 116.52070103424707
        }
        image_scaling_factor: 1
        max_objects_num_per_image: 10
        }
        arch: "resnet:34"
            anchor_box_config {
            scale: 20
            scale: 40
            scale: 90
            ratio: 1.0
            ratio: 0.5
            ratio: 2.0
            }
            freeze_bn: True
            roi_mini_batch: 256
            rpn_stride: 16
            use_bias: False
            roi_pooling_config {
            pool_size: 7
            pool_size_2x: False
            }
            all_projections: True
            use_pooling: False
            }
            dataset_config {
              data_sources: {
                tfrecords_path: "/workspace/ZNT/Z_35/tfrecords/train/tfrecord*"
                image_directory_path: "/workspace/DAB/D_39"
              }
            image_extension: 'jpg'
            
                    target_class_mapping {
                        key: 'p_1'
                        value: 'P'
                    }
                
                    target_class_mapping {
                        key: 'r_1'
                        value: 'R'
                    }
                
                    target_class_mapping {
                        key: 'ca_1'
                        value: 'DontCare'
                    }
                
                    target_class_mapping {
                        key: 'dc_0'
                        value: 'DontCare'
                    }
                
            validation_data_source: {
                tfrecords_path: "/workspace/ZNT/Z_35/tfrecords/val/tfrecord*"
                image_directory_path: "/workspace/DAB/D_39"
                }
            }
            augmentation_config {
            preprocessing {
            output_image_width: 960
            output_image_height: 540
            output_image_channel: 3
            min_bbox_width: 0.0
            min_bbox_height: 0.0
            enable_auto_resize: False
            }
            spatial_augmentation {
            hflip_probability: 0.5
            vflip_probability: 0.0
            zoom_min: 0.75
            zoom_max: 1.25
            translate_max_x: 192
            translate_max_y: 104
            rotate_rad_max: 0.7
            }
            color_augmentation {
            hue_rotation_max: 50
            saturation_shift_max: 0.3
            contrast_scale_max: 0.25
            contrast_center: 0.5
            }
            }
            training_config {
            checkpoint_interval: 1
            pretrained_weights: "/workspace/pretrained_models/resnet_34.hdf5"
            
            output_model: "/workspace/ZNT/Z_35/weights/model.tlt"
            enable_augmentation: True
            enable_qat: True
            batch_size_per_gpu: 12
            num_epochs: 100
            rpn_min_overlap: 0.3
            rpn_max_overlap: 0.7
            classifier_min_overlap: 0.0
            classifier_max_overlap: 0.5
            gt_as_roi: False
            std_scaling: 1.0
            classifier_regr_std {
            key: 'x'
            value: 10
            }
            classifier_regr_std {
            key: 'y'
            value: 10
            }
            classifier_regr_std {
            key: 'w'
            value: 5
            }
            classifier_regr_std {
            key: 'h'
            value: 5
            }
    
            rpn_mini_batch: 256
            rpn_pre_nms_top_N: 1000
            rpn_nms_max_boxes: 200
            rpn_nms_overlap_threshold: 0.6
    
            regularizer {
            
            type: L2
            weight: 1e-05
            
            }
    
            optimizer {
            
            adam {
            lr: 0.00001
            beta_1: 0.9
            beta_2: 0.999
            decay: 0.0
            }
            
            }
    
            learning_rate {
            
            step {
            base_lr: 2e-05
            gamma: 0.75
            step_size: 10
            }
            
            }
    
            lambda_rpn_regr: 1.0
            lambda_rpn_class: 1.0
            lambda_cls_regr: 1.0
            lambda_cls_class: 1.0
            }
            inference_config {
            images_dir: '/workspace/DAB/D_infer_test/128'
            model: '/workspace/ZNT/Z_35/models/model.epoch31.tlt'
            batch_size: 64
            detection_image_output_dir: '/workspace/DAB/D_infer_test/128_results/images'
            labels_dump_dir: '/workspace/DAB/D_infer_test/128_results/labels'
            rpn_pre_nms_top_N: 3000
            rpn_nms_max_boxes: 500
            rpn_nms_overlap_threshold: 0.7
            object_confidence_thres: 0.0001
            bbox_visualize_threshold: 0.9
            classifier_nms_max_boxes: 100
            classifier_nms_overlap_threshold: 0.3
            }
            evaluation_config {
            model: '/workspace/ZNT/Z_35/weights/model.epoch20.tlt'
            batch_size: 12
            validation_period_during_training: 1
            rpn_pre_nms_top_N: 3000
            rpn_nms_max_boxes: 500
            rpn_nms_overlap_threshold: 0.7
            classifier_nms_max_boxes: 100
            classifier_nms_overlap_threshold: 0.3
            object_confidence_thres: 0.0001
            use_voc07_11point_metric: False
            gt_matching_iou_threshold: 0.5
            }

Morganh · August 26, 2022, 5:06pm

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

How about running evaluation?

system · September 13, 2022, 1:34am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.