Classification inference huge performance degradation

alberto12 · January 26, 2022, 12:15pm

• Hardware (Nano)
• Network Type (EfficientNet B1)
• TLT Version (TAO 3-21.11)

Hello,

We are using NVIDIA TAO to train a classification model using Efficient Net B1 architecture. We later deploy the model in Jetson Nano using Deepstream. We execute TAO export command to generate the .etlt file that later is loaded by Deepstream to generate the .engine file used in the inference mode.

We have an issue with the performance. Precision obtained in TAO training is around 90% and in the Jetson it goes down to around 50%. We use the same pictures for testing, both in TAO and in Jetson, in order to avoid any other factor that may impact performance and to make sure we have an apples to apples comparison.

We have tested this performance in Jetson nano both using Deepstream and an external python ad hoc script, and in both cases the performance is similar. We have tested for different network types (INT8, FP16,FP32) and in all cases the performance is similar too.

We think the issue might be in the export phase, but we are not sure. We are using the command below to export to etlt as it is mentioned in the documentation.

tao classification export
-m trained_model.tlt
-o output_model.etlt
-k key
–cal_data_file $USER_EXPERIMENT_DIR/export/calibration.tensor
–data_type int8
–batches 10
–cal_cache_file cahe_file.bin
-v

We have mainly two questions:
• Do you know what can be the issue for this big performance degradation?
• Is there any way to test the .etlt file, in order to check in which step the performance degradation occurs?

Thanks in advance,
Alberto

Morganh · January 26, 2022, 1:29pm

It is similar to topic Tao Classifier Mobilenetv2 very low accuracy compared to effecientnet b0 & Resnet - #15 by tane.vanderboon . But I cannot reproduce.
That user fix the issue when use 3.21.08 docker. Please check if it also works on your side.
TAO Toolkit for Computer Vision | NVIDIA NGC
docker pull nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.08-py3

alberto12 · January 26, 2022, 1:36pm

Hi,

I think it is a little bit different to the topic Tao Classifier Mobilenetv2 very low accuracy compared to effecientnet b0 & Resnet - #15 by tane.vanderboon as our issue is not during the training phase.

In that topic there is a bad performance in the TAO training.

Our training in TAO generates a quite good performance (around 90%). The performance degradation occurs in the inference phase using an exported model in Jetson Nano.

Best regards,
Alberto

Morganh · January 26, 2022, 1:40pm

To narrow down, can you generate tensorrt engine directly in the tao docker and then run your inference code against the engine?

alberto12 · January 26, 2022, 1:43pm

Yes, we are doing it right now and come back to you as soon as we get the results.

Thank you,
Alberto

alberto12 · January 28, 2022, 2:53pm

Hi,

We have generated the TensorRT inside the Docker and the inference is still bad (around 50%).

We have used the following versions:

TAO image: nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.11-tf1.15.5-py3**
CUDA: 11
-TensorRT: 8.0.1-1

Do you know how we can test the .etlt file generated in the export process?

Regards,
Alberto

Morganh · January 28, 2022, 3:54pm

First of all, may I know how did you check the inference, with your own scripts , right?

alberto12 · January 28, 2022, 6:41pm

Hi,

WE have used either deepstream or our own script, and in both cases the performance is degraded to the one we obtain using TAO. Is there any “official” script you have that can be used?

I attach our script.

            import os
            
            import time
            
            import cv2
            
            import numpy as np
            
            import pycuda.autoinit
            
            import pycuda.driver as cuda
            
            import tensorrt as trt
            
            from PIL import Image
            
            import pdb
            
            import codecs
            
            import glob
            
            import datetime
            
            import shutil
            
            class HostDeviceMem(object):
            
                def __init__(self, host_mem, device_mem):
            
                    self.host = host_mem
            
                    self.device = device_mem
            
                def __str__(self):
            
                    return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)
            
                def __repr__(self):
            
                    return self.__str__()
            
            def load_engine(trt_runtime, engine_path):
            
                with open(engine_path, "rb") as f:
            
                    engine_data = f.read()
            
                engine = trt_runtime.deserialize_cuda_engine(engine_data)
            
                return engine
            
            # Allocates all buffers required for an engine, i.e. host/device inputs/outputs.
            
            # def allocate_buffers(engine, batch_size=-1):
            
            def allocate_buffers(engine, batch_size=1):
            
                inputs = []
            
                outputs = []
            
                bindings = []
            
                stream = cuda.Stream()
            
                for binding in engine:
            
                    # pdb.set_trace()
            
                    size = trt.volume(engine.get_binding_shape(binding)) * batch_size
            
                    dtype = trt.nptype(engine.get_binding_dtype(binding))
            
                    # Allocate host and device buffers
            
                    host_mem = cuda.pagelocked_empty(size, dtype)
            
                    device_mem = cuda.mem_alloc(host_mem.nbytes)
            
                    # Append the device buffer to device bindings.
            
                    bindings.append(int(device_mem))
            
                    # Append to the appropriate list.
            
                    if engine.binding_is_input(binding):
            
                        inputs.append(HostDeviceMem(host_mem, device_mem))
            
                        # print(f"input: shape:{engine.get_binding_shape(binding)} dtype:{engine.get_binding_dtype(binding)}")
            
                    else:
            
                        outputs.append(HostDeviceMem(host_mem, device_mem))
            
                        # print(f"output: shape:{engine.get_binding_shape(binding)} dtype:{engine.get_binding_dtype(binding)}")
            
                return inputs, outputs, bindings, stream
            
            def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
            
                # Transfer input data to the GPU.
            
                [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
            
                # Run inference.
            
                context.execute_async(
            
                    batch_size=batch_size, bindings=bindings, stream_handle=stream.handle
            
                )
            
                # Transfer predictions back from the GPU.
            
                [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
            
                # Synchronize the stream
            
                stream.synchronize()
            
                # Return only the host outputs.
            
                return [out.host for out in outputs]
            
            def post_processing(label_ids, classes):
            
                top_five_indexes = label_ids[0].argsort()[-5:][::-1]
            
                top_five_classes = []
            
                for index in top_five_indexes:
            
                    # [ [clase,probabilidad], [clase,probabilidad], ...] ]
            
                    top_five_classes.append([classes[index], label_ids[0][index]])
            
                    
            
                # iterate label using label ids
            
                max_value_index = top_five_indexes[0]
            
                max_value = top_five_classes[0][1]
            
                
            
                print("Index max value: " + str(max_value_index))
            
                print("Max value: " + str(max_value))
            
                
            
                return top_five_classes
            
            def model_loading(trt_engine_path, input_shape):
            
                # TensorRT logger singleton
            
                os.environ["CUDA_VISIBLE_DEVICES"] = "1"
            
                TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
            
                # trt_engine_path = "/opt/smarg/surveillance_gateway_prod/surveillance_ai_model/x86_64/Secondary_NumberPlateClassification/lpr_us_onnx_b16.engine"
            
                trt_runtime = trt.Runtime(TRT_LOGGER)
            
                # pdb.set_trace()
            
                trt_engine = load_engine(trt_runtime, trt_engine_path)
            
                # Execution context is needed for inference
            
                context = trt_engine.create_execution_context()
            
                # NPR input shape
            
                # input_shape = (3,48,96)
            
                context.set_binding_shape(0, input_shape)
            
                # This allocates memory for network inputs/outputs on both CPU and GPU
            
                inputs, outputs, bindings, stream = allocate_buffers(trt_engine)
            
                return inputs, outputs, bindings, stream, context
            
            def infer_image(classes, imageToInfer, model_parameters):
            
                image_count = 1
            
                start_time = datetime.datetime.now()
            
                
            
                print("Image name :", imageToInfer)
            
                image = [cv2.imread(imageToInfer)]
            
                image = np.array([(cv2.resize(img, (240 , 240))) for img in image], dtype=np.float32)
            
                image = image.transpose(0 , 3 , 1 , 2)
            
                np.copyto(model_parameters['inputs'][0].host, image.ravel())
            
                output = do_inference(model_parameters['context'], bindings=model_parameters['bindings'], inputs=model_parameters['inputs'], outputs=model_parameters['outputs'], stream=model_parameters['stream'])
            
                top_five_classes = post_processing(output, classes)
            
                
            
                print("TOP FIVE PREDICTIONS: " + str(top_five_classes))
            
                print("BEST PREDICTION: " + str(top_five_classes[0]))
            
                """
            
                for image_path in glob.glob(images_folder_path + "*.jpg"):
            
                    print("Image name :", image_path)
            
                    image = [cv2.imread(image_path)]
            
                    image = np.array([(cv2.resize(img, (240 , 240))) for img in image], dtype=np.float32)
            
                    image= image.transpose(0 , 3 , 1 , 2)
            
                    np.copyto(model_parameters['inputs'][0].host, image.ravel())
            
                    output = do_inference(model_parameters['context'], bindings=model_parameters['bindings'], inputs=model_parameters['inputs'], outputs=model_parameters['outputs'], stream=model_parameters['stream'])
            
                    top_five_classes = post_processing(output, classes)
            
                    image_count += 1
            
                    print("TOP FIVE PREDICTIONS: " + str(top_five_classes))
            
                    print("BEST PREDICTION: " + str(top_five_classes[0]))
            
                """
            
                
            
                end_time = datetime.datetime.now()
            
                total_time = end_time - start_time
            
                print("Total image processed : {} Total Time : {} ".format(image_count, total_time))
            
                return top_five_classes

Regards,
Alberto

Morganh · January 29, 2022, 3:41am

Please refer to Issue with image classification tutorial and testing with deepstream-app - #21 by Morganh and Issue with image classification tutorial and testing with deepstream-app - #26 by Morganh

Officially, please try to run inference with triton-app. Integrating TAO CV Models with Triton Inference Server — TAO Toolkit 3.22.05 documentation

alberto12 · January 31, 2022, 2:30pm

Hi,

In the two links you provided inference is done using an “engine” file. We have checked it again, also including the changes mentioned in thos threads and the performance is still low.

We know that “.engine” performance is degraded regarding “.tlt” one. We wonder if there is any way to check “.etlt” performance as iti is the intermediate step.

Do you have any way of checking it?

Best Regards,
Alberto

Morganh · February 4, 2022, 4:09pm

The .etlt is not the cause of degradation.
And officially end users can only deploy the .etlt model in deepstream to run inference.
As mentioned in above link Issue with image classification tutorial and testing with deepstream-app - #21 by Morganh , please note that there are several hints to improve the inference accuracy.

Generate video file with gstreamer instead of ffmpeg.
gst-launch-1.0 multifilesrc location=“/tmp/%d.jpg” caps=“image/jpeg,framerate=30/1” ! jpegdec ! x264enc ! avimux ! filesink location=“out.avi”
Run inference against this avi file with deepstream.
Add below parameter in your config file (config_as_primary_gie.txt)
scaling-filter=5
Issue with image classification tutorial and testing with deepstream-app - #32 by Morganh

No, it is not expected. And previously we found that the .engine can have similar result as .tlt model, using deepstream way or standalone inference way.

system · February 18, 2022, 4:10pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Fine-tuned TAO ClassificationTF2 Accuracy Drop after Compiling to TensorRT TAO Toolkit	34	909	August 6, 2024
Converting etlt file to .engine for jetson TAO Toolkit	17	3028	October 25, 2022
Trained tlt model works with more than 95% accuracy but exported etlt performs poorly TAO Toolkit tensorrt	9	563	April 19, 2023
Issues with tao classifier_tf2 in deepstream (Accuracy drops) TAO Toolkit deepstream	21	93	September 6, 2024
The effect is very poor when converted to trt TAO Toolkit tensorrt , ubuntu	61	1511	September 11, 2023
TAO toolkit: Accuracy drops classiffication tf2 when trt engine is generated TAO Toolkit deepstream	3	31	September 18, 2024
Deepstream Onnx inference no output TAO Toolkit	29	101	August 15, 2024
TAO inference on INT8 Image classififier TAO Toolkit	3	653	July 6, 2022
TAO-Converter TRT engine inference results is blank TAO Toolkit tensorrt , tao , image-processing	9	601	July 21, 2023
Tao deploy error - TAO Toolkit jetson , deepstream	44	240	August 25, 2025

Classification inference huge performance degradation

Related topics