Unet tlt model files vs checkpoint tlt files after X epochs

I am training a tao 5.1 unet model with the following command:


!tao model unet train --gpus $NUM_GPUS \
                      --gpu_index $GPU_INDEX \
                      -e $SPECS_DIR/unet_train_vgg_6S250.txt \
                      -r $USER_EXPERIMENT_DIR/unpruned \
                      -m $USER_EXPERIMENT_DIR/pretrained_vgg16/pretrained_semantic_segmentation_vvgg16/vgg_16.hdf5 \
                      -n 6SBan003

The spec file specifies 250 epochs.

The final model will be at unpruned/weights/6SBan003.tlt

Also, there is a file at unpruned/model.epoch-250.tlt

Are they the same? Can I use model.epoch-250.tlt as input to the prune operation, or export as is to a
tensorrt engine?

Thanks!

David

During Unet training, it will save the last training step model to weights directory. Refer to https://github.com/NVIDIA/tao_tensorflow1_backend/blob/c7a3926ddddf3911842e057620bceb45bb5303cc/nvidia_tao_tf1/cv/unet/scripts/train.py#L301 and https://github.com/NVIDIA/tao_tensorflow1_backend/blob/main/nvidia_tao_tf1/cv/unet/model/utilities.py#L207.

You can use it to run pruning or exporting.

@Morganh Thanks! Very enlightening to see the code.

@Morganh I exported the unpruned tlt to tensorrt and got errors when loading:

1: [stdArchiveReader.cpp::StdArchiveReader::29] Error Code 1: Serialization (Serialization assertion magicTagRead == magicTag failed.Magic tag does not match)
4: [runtime.cpp::deserializeCudaEngine::76] Error Code 4: Internal Error (Engine deserialization failed.)
terminate called after throwing an instance of ‘std::runtime_error’
what(): Unable to load tensorRT engine. /mnt/DATA/MP/export/trtfp32.6SR003Unpruned.engine

Exported with

# Convert to TensorRT engine(FP32).
!tao deploy unet gen_trt_engine --gpu_index $GPU_INDEX \
                                -m $USER_EXPERIMENT_DIR/export/model.epoch-500.onnx \
                                -e $SPECS_DIR/unet_train_vgg_6S.txt \
                                -r $USER_EXPERIMENT_DIR/export \
                                --data_type fp32 \
                                --engine_file $USER_EXPERIMENT_DIR/export/trtfp32.6SR003Unpruned.engine \
                                --max_batch_size 3

Verified the exported model with

!tao deploy unet evaluate --gpu_index $GPU_INDEX -e $SPECS_DIR/unet_train_vgg_6S.txt \
                          -m $USER_EXPERIMENT_DIR/export/trtfp32.6SR003Unpruned.engine \
                          -r $USER_EXPERIMENT_DIR/export/

My tensorrt is version 8.0.1-1+cuda11.3

And the spec file unet_train_vgg_6S.txt (1.8 KB)

tao info

Configuration of the TAO Toolkit Instance
task_group: [‘model’, ‘dataset’, ‘deploy’]
format_version: 3.0
toolkit_version: 5.2.0
published_date: 12/06/2023

This is a common error when TensorRT version during building is different from the TensorRT version during inference.

There should be no issue when you run tao deploy unet gen_trt_engin to generate tensort engine and run tao deploy unet evaluate to evaluate this tensort engine, right?

@Morganh

I understand, but how do I get the correct TensorRT and CUDA versions?

Correct. Within the deploy container all works well, but I need to use under C++. Either I can install the toolchain versions that are compatible, but unknown, or use another method to convert the model to tensrrt as there was in tao 3

Thanks!

I ran

docker run -it --rm --gpus all nvcr.io/nvidia/tao/tao-toolkit:5.2.0-deploy

To get a prompt inside the docker, and

dpkg -l | grep nvinfer

And got what I think is the answer:

TensorRT 8.6.1.6-1+cuda12.0

There is no update from you for a period, assuming this is not an issue anymore.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Yes, you can check the TRT version where you generate the engine. Then make sure when you run inference, the TRT version is the same as it.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.