Unet tlt model files vs checkpoint tlt files after X epochs

david9xqqb · December 13, 2023, 8:18pm

I am training a tao 5.1 unet model with the following command:


!tao model unet train --gpus $NUM_GPUS \
                      --gpu_index $GPU_INDEX \
                      -e $SPECS_DIR/unet_train_vgg_6S250.txt \
                      -r $USER_EXPERIMENT_DIR/unpruned \
                      -m $USER_EXPERIMENT_DIR/pretrained_vgg16/pretrained_semantic_segmentation_vvgg16/vgg_16.hdf5 \
                      -n 6SBan003

The spec file specifies 250 epochs.

The final model will be at unpruned/weights/6SBan003.tlt

Also, there is a file at unpruned/model.epoch-250.tlt

Are they the same? Can I use model.epoch-250.tlt as input to the prune operation, or export as is to a
tensorrt engine?

Thanks!

David

Morganh · December 14, 2023, 2:21am

During Unet training, it will save the last training step model to weights directory. Refer to https://github.com/NVIDIA/tao_tensorflow1_backend/blob/c7a3926ddddf3911842e057620bceb45bb5303cc/nvidia_tao_tf1/cv/unet/scripts/train.py#L301 and https://github.com/NVIDIA/tao_tensorflow1_backend/blob/main/nvidia_tao_tf1/cv/unet/model/utilities.py#L207.

You can use it to run pruning or exporting.

david9xqqb · December 18, 2023, 2:58pm

@Morganh Thanks! Very enlightening to see the code.

david9xqqb · December 18, 2023, 8:13pm

@Morganh I exported the unpruned tlt to tensorrt and got errors when loading:

1: [stdArchiveReader.cpp::StdArchiveReader::29] Error Code 1: Serialization (Serialization assertion magicTagRead == magicTag failed.Magic tag does not match)
4: [runtime.cpp::deserializeCudaEngine::76] Error Code 4: Internal Error (Engine deserialization failed.)
terminate called after throwing an instance of ‘std::runtime_error’
what(): Unable to load tensorRT engine. /mnt/DATA/MP/export/trtfp32.6SR003Unpruned.engine

Exported with

# Convert to TensorRT engine(FP32).
!tao deploy unet gen_trt_engine --gpu_index $GPU_INDEX \
                                -m $USER_EXPERIMENT_DIR/export/model.epoch-500.onnx \
                                -e $SPECS_DIR/unet_train_vgg_6S.txt \
                                -r $USER_EXPERIMENT_DIR/export \
                                --data_type fp32 \
                                --engine_file $USER_EXPERIMENT_DIR/export/trtfp32.6SR003Unpruned.engine \
                                --max_batch_size 3

Verified the exported model with

!tao deploy unet evaluate --gpu_index $GPU_INDEX -e $SPECS_DIR/unet_train_vgg_6S.txt \
                          -m $USER_EXPERIMENT_DIR/export/trtfp32.6SR003Unpruned.engine \
                          -r $USER_EXPERIMENT_DIR/export/

My tensorrt is version 8.0.1-1+cuda11.3

And the spec file unet_train_vgg_6S.txt (1.8 KB)

tao info

Configuration of the TAO Toolkit Instance
task_group: [‘model’, ‘dataset’, ‘deploy’]
format_version: 3.0
toolkit_version: 5.2.0
published_date: 12/06/2023

Morganh · December 19, 2023, 5:00am

This is a common error when TensorRT version during building is different from the TensorRT version during inference.

There should be no issue when you run tao deploy unet gen_trt_engin to generate tensort engine and run tao deploy unet evaluate to evaluate this tensort engine, right?

david9xqqb · December 19, 2023, 9:38am

@Morganh

I understand, but how do I get the correct TensorRT and CUDA versions?

Correct. Within the deploy container all works well, but I need to use under C++. Either I can install the toolchain versions that are compatible, but unknown, or use another method to convert the model to tensrrt as there was in tao 3

Thanks!

david9xqqb · December 19, 2023, 2:02pm

I ran

docker run -it --rm --gpus all nvcr.io/nvidia/tao/tao-toolkit:5.2.0-deploy

To get a prompt inside the docker, and

dpkg -l | grep nvinfer

And got what I think is the answer:

TensorRT 8.6.1.6-1+cuda12.0

Morganh · December 20, 2023, 2:22am

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Yes, you can check the TRT version where you generate the engine. Then make sure when you run inference, the TRT version is the same as it.

system · January 7, 2024, 3:19am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Missing export file while exporting the model.tlt TAO Toolkit	3	370	October 12, 2021
Cannot use TensorRT model exported by NVIDIA TAO TAO Toolkit	8	1185	May 17, 2022
MAJOR ACCURACY LOSS when EXPORTING tao unet model after retraining pruned model TAO Toolkit	29	1384	November 22, 2022
Tao toolkit transfer learning with Trafficcamnet -> .etlt files TAO Toolkit	9	1071	October 9, 2023
LPRNet can't use exported engine file TAO Toolkit	18	2580	December 28, 2021
How to convert a tlt model into TensortRT model with the .trt postfix? TAO Toolkit	7	498	October 12, 2021
Unable to load in TensorRT model exported by TAO converter TAO Toolkit	3	505	April 13, 2022
Tlt unet evaluate failed TAO Toolkit	10	534	September 18, 2021
Inference with tensorrt engine file has different results compared with trained hdf5 model TAO Toolkit	9	248	July 8, 2024
Error while trying to read the TensorRT engine file generated by Tao toolkit TAO Toolkit tensorrt	12	1829	May 8, 2023

Unet tlt model files vs checkpoint tlt files after X epochs

Related topics