TensoRT export of PoseNet batch size problems

maksymilian.mazurkiewicz · April 14, 2023, 12:01pm

Hello,

I would like to ask how I can solve my problem with inference on exported model (bpnet) to TensorRT.

Here are important cells from my notebook:

Env variables setup

%set_env MAX_BATCH_SIZE=32
%set_env OPT_BATCH_SIZE=32

Export of trainable posenet model (i don’t need to training)

!tao bpnet export -m $USER_EXPERIMENT_DIR/pretrained_model/bodyposenet_vtrainable_v1.0/model.tlt
-e $SPECS_DIR/bpnet_custom_m1.yaml
-o $USER_EXPERIMENT_DIR/models/exported/bpnet_model_$MAX_BATCH_SIZE.etlt
-k $KEY
-t tfonnx
–batch_size=$OPT_BATCH_SIZE
–max_batch_size=$MAX_BATCH_SIZE

Setup of inference parameters

%set_env IN_HEIGHT=244
%set_env IN_WIDTH=320
%set_env IN_CHANNELS=3
%set_env INPUT_SHAPE=224x320x3
%set_env INPUT_NAME=input_1:0

Convert to fp16 TensorRT

!tao converter $USER_EXPERIMENT_DIR/models/exported/bpnet_model_$MAX_BATCH_SIZE.etlt
-k $KEY
-t fp16
- e $USER_EXPERIMENT_DIR/models/tensorRT/bpnet_model.$IN_HEIGHT.$IN_WIDTH.$MAX_BATCH_SIZE.fp16.engine
-p ${INPUT_NAME},1x$INPUT_SHAPE,${OPT_BATCH_SIZE}x$INPUT_SHAPE,${MAX_BATCH_SIZE}x$INPUT_SHAPE

Convert to fp16 TensorRT output:

2023-04-14 11:17:02,796 [INFO] root: Registry: [‘nvcrio’]
2023-04-14 11:17:02,839 [INFO] tlt.components.instance_handler.local_instance: Running command in container: ncr.o/nvidia/tao/tao-toolkit:4.0.0-tf1.15.5
[INFO] [MemUsageChange] Init CUDA: CPU +328, GPU +0, now: CPU 340, GPU 1326 (MiB)
[INFO] [MemUsageChange] Init builder kernel library: CPU +442, GPU +116, now: CPU 837, GPU 1483 (MiB)
[WARNING] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See CUDA_MODULE_LOADING in htps://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
[INFO] ----------------------------------------------------------------
[INFO] Input filename: /tmp/filewJavQn
[INFO] ONNX IR version: 0.0.5
[INFO] Opset version: 10
[INFO] Producer name: tf2onnx
[INFO] Producer version: 1.9.2
[INFO] Domain:
[INFO] Model version: 0
[INFO] Doc string:
[INFO] ----------------------------------------------------------------
[INFO] Detected input dimensions from the model: (-1, -1, -1, 3)
[INFO] Model has dynamic shape. Setting up optimization profiles.
[INFO] Using optimization profile min shape: (1, 224, 320, 3) for input: input_1:0
[INFO] Using optimization profile opt shape: (1, 224, 320, 3) for input: input_1:0
[INFO] Using optimization profile max shape: (32, 224, 320, 3) for input: input_1:0
[INFO] [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +854, GPU +362, now: CPU 1757, GPU 1857 (MiB)
[INFO] [MemUsageChange] Init cuDNN: CPU +126, GPU +58, now: CPU 1883, GPU 1915 (MiB)
[INFO] Local timing cache in use. Profiling results in this builder pass will not be stored.
[INFO] Some tactics do not have sufficient workspace memory to run. Increasing workspace size will enable more tactics, please check verbose output for requested sizes.
[INFO] Total Activation Memory: 2754638848
[INFO] Detected 1 inputs and 2 output network tensors.
[INFO] Total Host Persistent Memory: 88384
[INFO] Total Device Persistent Memory: 821248
[INFO] Total Scratch Memory: 0
[INFO] [MemUsageStats] Peak memory usage of TRT CPU/GPU memory allocators: CPU 34 MiB, GPU 1410 MiB
[INFO] [BlockAssignment] Started assigning block shifts. This will take 45 steps to complete.
[INFO] [BlockAssignment] Algorithm ShiftNTopDown took 0.399243ms to assign 5 blocks to 45 nodes requiring 614728192 bytes.
[INFO] Total Activation Memory: 614728192
[WARNING] TensorRT encountered issues when converting weights between types and that could affect accuracy.
[WARNING] If this is not the desired behavior, please modify the weights or retrain with regularization to adjust the magnitude of the weights.
[WARNING] Check verbose logs for the list of affected weights.
[WARNING] - 39 weights are affected by this issue: Detected subnormal FP16 values.
[WARNING] - 34 weights are affected by this issue: Detected values less than smallest positive FP16 subnormal value and converted them to the FP16 minimum subnormalized value.
[INFO] [MemUsageChange] TensorRT-managed allocation in building engine: CPU +32, GPU +33, now: CPU 32, GPU 33 (MiB)
2023-04-14 11:17:48,972 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

tl;dr everything seems to be okay

Inference on converted TensorRT model

!tao bpnet inference --inference_spec $SPECS_DIR/infer_spec_custom_small.yaml
–model_filename $USER_EXPERIMENT_DIR/models/tensorRT/bpnet_model.$IN_HEIGHT.$IN_WIDTH.$MAX_BATCH_SIZE.fp16.engine
–input_type dir
–input $USER_EXPERIMENT_DIR/data/images/
–results_dir $USER_EXPERIMENT_DIR/$TENSORT_INFER_DIR
–dump_visualizations

Here is the trace after !tao bpnet inference (…):

WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/init.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn’t match a supported version!
RequestsDependencyWarning)
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-5n7t1b0l because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
Using TensorFlow backend.
WARNING:tensorflow:Deprecation warnings have been disabled. Set TF_ENABLE_DEPRECATION_WARNINGS=1 to re-enable them.
/usr/local/lib/python3.6/dist-packages/requests/init.py:91: RequestsDependencyWarning: urllib3 (1.26.5) or chardet (3.0.4) doesn’t match a supported version!
RequestsDependencyWarning)
INFO 2023-04-14 09:18:00,151 | main: Reading from directory: /workspace/tao-experiments/bpnet/data/images/
/usr/local/lib/python3.6/dist-packages/driveix/common/utilities/path_processing.py:74: YAMLLoadWarning: calling yaml.load() without Loader=… is deprecated, as the default Loader is unsafe. Please read htps://msg.pyyaml.org/load for full details.
/workspace/tao-experiments/bpnet/results/exp_m1_tensorRT_244_320_32
INFO 2023-04-14 09:18:00,160 | main: Loading /workspace/tao-experiments/bpnet/models/tensorRT/bpnet_model.244.320.32.fp16.engine for inference.
INFO 2023-04-14 09:18:00,688 | driveix.common.inferencer.trt_inferencer: Loading TensorRT engine: /workspace/tao-experiments/bpnet/models/tensorRT/bpnet_model.244.320.32.fp16.engine
INFO 2023-04-14 09:18:00,688 | driveix.bpnet.inferencer.bpnet_inferencer: /workspace/tao-experiments/bpnet/models/tensorRT/bpnet_model.244.320.32.fp16.engine
INFO 2023-04-14 09:18:00,688 | driveix.bpnet.inferencer.bpnet_inferencer: Successfully loaded /workspace/tao-experiments/bpnet/models/tensorRT/bpnet_model.244.320.32.fp16.engine
0%| | 0/31 [00:00<?, ?it/s][04/14/2023-09:18:00] [TRT] [W] CUDA lazy loading is not enabled. Enabling it can significantly reduce device memory usage. See CUDA_MODULE_LOADING in htps://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars
[04/14/2023-09:18:00] [TRT] [W] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
ERROR 2023-04-14 09:18:00,750 | driveix.bpnet.inferencer.bpnet_inferencer: TRT execution failed. Please ensure that the input_shape matches the model input dims
ERROR 2023-04-14 09:18:00,750 | driveix.bpnet.inferencer.bpnet_inferencer: _start
Traceback (most recent call last):
File “”, line 209, in _create_context
ValueError: Unhandled shape: (-1, 224, 320, 3)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File “</usr/local/lib/python3.6/dist-packages/driveix/bpnet/scripts/inference.py>”, line 3, in
File “”, line 195, in
File “”, line 190, in main
File “”, line 486, in run
File “”, line 277, in run_pipeline
File “”, line 172, in infer
File “”, line 167, in infer
File “”, line 66, in predict
File “”, line 348, in infer
File “/usr/lib/python3.6/contextlib.py”, line 81, in enter
return next(self.gen)
File “”, line 233, in _create_context
File “”, line 321, in _release_context
AttributeError: _start
[04/14/2023-09:18:01] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::42] Error Code 1: Cuda Runtime (invalid argument)
[04/14/2023-09:18:01] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::42] Error Code 1: Cuda Runtime (invalid argument)
[04/14/2023-09:18:01] [TRT] [E] 1: [defaultAllocator.cpp::deallocate::42] Error Code 1: Cuda Runtime (invalid argument)
[04/14/2023-09:18:01] [TRT] [E] 1: [cudaResources.cpp::~ScopedCudaStream::47] Error Code 1: Cuda Runtime (invalid device context)
[04/14/2023-09:18:01] [TRT] [E] 1: [cudaResources.cpp::~ScopedCudaEvent::24] Error Code 1: Cuda Runtime (context is destroyed)
Telemetry data couldn’t be sent, but the command ran successfully.
[WARNING]: <urlopen error [Errno -2] Name or service not known>
Execution status: FAIL
2023-04-14 11:18:02,159 [INFO] tlt.components.docker_handler.docker_handler: Stopping container.

tl;dr

ERROR 2023-04-14 09:18:00,750 | driveix.bpnet.inferencer.bpnet_inferencer: _start
Traceback (most recent call last):
File “”, line 209, in _create_context
ValueError: Unhandled shape: (-1, 224, 320, 3)

Everything works flawlessly on this setup (there inference is completed I got fine results):

%set_env MAX_BATCH_SIZE=1
%set_env OPT_BATCH_SIZE=1

but on other batch setups I receive above trace after inference. How to fix it?

Additional specs:
infer_spec_small.yaml

train_spec: /workspace/examples/bpnet/specs/bpnet_retrain_m1_coco.yaml
input_shape: [224, 320]
choose from: {pad_image_input, adjust_network_input, None}
keep_aspect_ratio_mode: pad_image_input
output_stage_to_use: null
output_upsampling_factor: [8, 8]
heatmap_threshold: 0.1
paf_threshold: 0.05
multi_scale_inference: False
scales: [0.5, 1.0, 1.5, 2.0]

bpnet_retrain_m1_coco.yaml
same as in your quick start notebooks.

PS I had to remove all trace links because of your policy.

Morganh · April 15, 2023, 11:28am

Could you please use below way to check against above tensorrt engine?

$ python -m pip install colored
$ python -m pip install polygraphy --index-url https://pypi.ngc.nvidia.com
$ polygraphy inspect model xxx.engine

maksymilian.mazurkiewicz · April 15, 2023, 1:52pm

!python -m pip install colored
os.environ[“POLYGRAPHY_AUTOINSTALL_DEPS”] = ‘1’
!python -m pip install polygraphy --index-url https://pypi.ngc.nvidia.com
!polygraphy inspect model $LOCAL_EXPERIMENT_DIR/models/tensorRT/bpnet_model.244.320.32.fp16.engine -v

Verbose result:

Requirement already satisfied: colored in /home/~/miniconda3/envs/nvidia-TAO-venv/lib/python3.6/site-packages (1.4.4)
Looking in indexes: https://pypi.ngc.nvidia.com
Requirement already satisfied: polygraphy in /home/~/miniconda3/envs/nvidia-TAO-venv/lib/python3.6/site-packages (0.46.2)
[V] Model:~/bpnet/bpnet/models/tensorRT/bpnet_model.244.320.32.fp16.engine
[V] Loaded Module: polygraphy | Version: 0.46.2 | Path: [‘/home/~/miniconda3/envs/nvidia-TAO-venv/lib/python3.6/site-packages/polygraphy’]
[V] Loaded Module: tensorrt | Version: 8.6.0 | Path: [‘/home/~/miniconda3/envs/nvidia-TAO-venv/lib/python3.6/site-packages/tensorrt’]
[I] Loading bytes from /home/~/bpnet/bpnet/models/tensorRT/bpnet_model.244.320.32.fp16.engine
[V] Loaded engine size: 32 MiB
[E] 1: [runtime.cpp::parsePlan::314] Error Code 1: Serialization (Serialization assertion plan->header.magicTag == rt::kPLAN_MAGIC_TAG failed.)
[!] Could not deserialize engine. See log for details.

This is for the broken batch=32 engine. I got same error for batch=1 engine (which works).

Morganh · April 16, 2023, 4:48pm

Could you elaborate more for “which works”? Do you mean bs1 engine can run inference well?

maksymilian.mazurkiewicz · April 16, 2023, 6:25pm

Currently, it looks like this:

Inference works correctly on a converted model with max batch size = 1 (fp16 engine).
Inference does not work correctly on a converted model with max batch size > 1 (fp16 engine) (I tried bs=16 and bs=32), stack trace as in the first post
Polygraphy inspect model does not give results for either model

I need to convert the model to .trt to deploy in on triton which needs bs=32.

Morganh · April 17, 2023, 8:22am

Thanks for the info, I will check if I can reproduce.

Morganh · April 20, 2023, 8:16am

There is no update from you for a period, assuming this is not an issue anymore. Hence we are closing this topic. If need further support, please open a new one. Thanks

Please run bodypose with https://github.com/NVIDIA-AI-IOT/deepstream_tao_apps/tree/master/apps/tao_others/deepstream-bodypose2d-app

By default, it is 32 in https://github.com/NVIDIA-AI-IOT/deepstream_tao_apps/blob/master/configs/bodypose2d_tao/bodypose2d_pgie_config.txt#L58

system · May 19, 2023, 2:28am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.