Classification inference huge performance degradation

• Hardware (Nano)
• Network Type (EfficientNet B1)
• TLT Version (TAO 3-21.11)

Hello,

We are using NVIDIA TAO to train a classification model using Efficient Net B1 architecture. We later deploy the model in Jetson Nano using Deepstream. We execute TAO export command to generate the .etlt file that later is loaded by Deepstream to generate the .engine file used in the inference mode.

We have an issue with the performance. Precision obtained in TAO training is around 90% and in the Jetson it goes down to around 50%. We use the same pictures for testing, both in TAO and in Jetson, in order to avoid any other factor that may impact performance and to make sure we have an apples to apples comparison.

We have tested this performance in Jetson nano both using Deepstream and an external python ad hoc script, and in both cases the performance is similar. We have tested for different network types (INT8, FP16,FP32) and in all cases the performance is similar too.

We think the issue might be in the export phase, but we are not sure. We are using the command below to export to etlt as it is mentioned in the documentation.

tao classification export
-m trained_model.tlt
-o output_model.etlt
-k key
–cal_data_file $USER_EXPERIMENT_DIR/export/calibration.tensor
–data_type int8
–batches 10
–cal_cache_file cahe_file.bin
-v

We have mainly two questions:
• Do you know what can be the issue for this big performance degradation?
• Is there any way to test the .etlt file, in order to check in which step the performance degradation occurs?

Thanks in advance,
Alberto

It is similar to topic Tao Classifier Mobilenetv2 very low accuracy compared to effecientnet b0 & Resnet - #15 by tane.vanderboon . But I cannot reproduce.
That user fix the issue when use 3.21.08 docker. Please check if it also works on your side.
TAO Toolkit for CV | NVIDIA NGC
docker pull nvcr.io/nvidia/tao/tao-toolkit-tf:v3.21.08-py3

Hi,

I think it is a little bit different to the topic Tao Classifier Mobilenetv2 very low accuracy compared to effecientnet b0 & Resnet - #15 by tane.vanderboon as our issue is not during the training phase.

In that topic there is a bad performance in the TAO training.

Our training in TAO generates a quite good performance (around 90%). The performance degradation occurs in the inference phase using an exported model in Jetson Nano.

Best regards,
Alberto

To narrow down, can you generate tensorrt engine directly in the tao docker and then run your inference code against the engine?

Yes, we are doing it right now and come back to you as soon as we get the results.

Thank you,
Alberto

Hi,

We have generated the TensorRT inside the Docker and the inference is still bad (around 50%).

We have used the following versions:

Do you know how we can test the .etlt file generated in the export process?

Regards,
Alberto

Regards,
Alberto

First of all, may I know how did you check the inference, with your own scripts , right?

Hi,

WE have used either deepstream or our own script, and in both cases the performance is degraded to the one we obtain using TAO. Is there any “official” script you have that can be used?

I attach our script.

            import os
            
            import time
            
            import cv2
            
            import numpy as np
            
            import pycuda.autoinit
            
            import pycuda.driver as cuda
            
            import tensorrt as trt
            
            from PIL import Image
            
            import pdb
            
            import codecs
            
            import glob
            
            import datetime
            
            import shutil
            
            class HostDeviceMem(object):
            
                def __init__(self, host_mem, device_mem):
            
                    self.host = host_mem
            
                    self.device = device_mem
            
                def __str__(self):
            
                    return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)
            
                def __repr__(self):
            
                    return self.__str__()
            
            def load_engine(trt_runtime, engine_path):
            
                with open(engine_path, "rb") as f:
            
                    engine_data = f.read()
            
                engine = trt_runtime.deserialize_cuda_engine(engine_data)
            
                return engine
            
            # Allocates all buffers required for an engine, i.e. host/device inputs/outputs.
            
            # def allocate_buffers(engine, batch_size=-1):
            
            def allocate_buffers(engine, batch_size=1):
            
                inputs = []
            
                outputs = []
            
                bindings = []
            
                stream = cuda.Stream()
            
                for binding in engine:
            
                    # pdb.set_trace()
            
                    size = trt.volume(engine.get_binding_shape(binding)) * batch_size
            
                    dtype = trt.nptype(engine.get_binding_dtype(binding))
            
                    # Allocate host and device buffers
            
                    host_mem = cuda.pagelocked_empty(size, dtype)
            
                    device_mem = cuda.mem_alloc(host_mem.nbytes)
            
                    # Append the device buffer to device bindings.
            
                    bindings.append(int(device_mem))
            
                    # Append to the appropriate list.
            
                    if engine.binding_is_input(binding):
            
                        inputs.append(HostDeviceMem(host_mem, device_mem))
            
                        # print(f"input: shape:{engine.get_binding_shape(binding)} dtype:{engine.get_binding_dtype(binding)}")
            
                    else:
            
                        outputs.append(HostDeviceMem(host_mem, device_mem))
            
                        # print(f"output: shape:{engine.get_binding_shape(binding)} dtype:{engine.get_binding_dtype(binding)}")
            
                return inputs, outputs, bindings, stream
            
            def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
            
                # Transfer input data to the GPU.
            
                [cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
            
                # Run inference.
            
                context.execute_async(
            
                    batch_size=batch_size, bindings=bindings, stream_handle=stream.handle
            
                )
            
                # Transfer predictions back from the GPU.
            
                [cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
            
                # Synchronize the stream
            
                stream.synchronize()
            
                # Return only the host outputs.
            
                return [out.host for out in outputs]
            
            def post_processing(label_ids, classes):
            
                top_five_indexes = label_ids[0].argsort()[-5:][::-1]
            
                top_five_classes = []
            
                for index in top_five_indexes:
            
                    # [ [clase,probabilidad], [clase,probabilidad], ...] ]
            
                    top_five_classes.append([classes[index], label_ids[0][index]])
            
                    
            
                # iterate label using label ids
            
                max_value_index = top_five_indexes[0]
            
                max_value = top_five_classes[0][1]
            
                
            
                print("Index max value: " + str(max_value_index))
            
                print("Max value: " + str(max_value))
            
                
            
                return top_five_classes
            
            def model_loading(trt_engine_path, input_shape):
            
                # TensorRT logger singleton
            
                os.environ["CUDA_VISIBLE_DEVICES"] = "1"
            
                TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
            
                # trt_engine_path = "/opt/smarg/surveillance_gateway_prod/surveillance_ai_model/x86_64/Secondary_NumberPlateClassification/lpr_us_onnx_b16.engine"
            
                trt_runtime = trt.Runtime(TRT_LOGGER)
            
                # pdb.set_trace()
            
                trt_engine = load_engine(trt_runtime, trt_engine_path)
            
                # Execution context is needed for inference
            
                context = trt_engine.create_execution_context()
            
                # NPR input shape
            
                # input_shape = (3,48,96)
            
                context.set_binding_shape(0, input_shape)
            
                # This allocates memory for network inputs/outputs on both CPU and GPU
            
                inputs, outputs, bindings, stream = allocate_buffers(trt_engine)
            
                return inputs, outputs, bindings, stream, context
            
            def infer_image(classes, imageToInfer, model_parameters):
            
                image_count = 1
            
                start_time = datetime.datetime.now()
            
                
            
                print("Image name :", imageToInfer)
            
                image = [cv2.imread(imageToInfer)]
            
                image = np.array([(cv2.resize(img, (240 , 240))) for img in image], dtype=np.float32)
            
                image = image.transpose(0 , 3 , 1 , 2)
            
                np.copyto(model_parameters['inputs'][0].host, image.ravel())
            
                output = do_inference(model_parameters['context'], bindings=model_parameters['bindings'], inputs=model_parameters['inputs'], outputs=model_parameters['outputs'], stream=model_parameters['stream'])
            
                top_five_classes = post_processing(output, classes)
            
                
            
                print("TOP FIVE PREDICTIONS: " + str(top_five_classes))
            
                print("BEST PREDICTION: " + str(top_five_classes[0]))
            
                """
            
                for image_path in glob.glob(images_folder_path + "*.jpg"):
            
                    print("Image name :", image_path)
            
                    image = [cv2.imread(image_path)]
            
                    image = np.array([(cv2.resize(img, (240 , 240))) for img in image], dtype=np.float32)
            
                    image= image.transpose(0 , 3 , 1 , 2)
            
                    np.copyto(model_parameters['inputs'][0].host, image.ravel())
            
                    output = do_inference(model_parameters['context'], bindings=model_parameters['bindings'], inputs=model_parameters['inputs'], outputs=model_parameters['outputs'], stream=model_parameters['stream'])
            
                    top_five_classes = post_processing(output, classes)
            
                    image_count += 1
            
                    print("TOP FIVE PREDICTIONS: " + str(top_five_classes))
            
                    print("BEST PREDICTION: " + str(top_five_classes[0]))
            
                """
            
                
            
                end_time = datetime.datetime.now()
            
                total_time = end_time - start_time
            
                print("Total image processed : {} Total Time : {} ".format(image_count, total_time))
            
                return top_five_classes

Regards,
Alberto

Please refer to Issue with image classification tutorial and testing with deepstream-app - #21 by Morganh and Issue with image classification tutorial and testing with deepstream-app - #26 by Morganh

Officially, please try to run inference with triton-app. Integrating TAO CV Models with Triton Inference Server — TAO Toolkit 3.21.11 documentation

Hi,

In the two links you provided inference is done using an “engine” file. We have checked it again, also including the changes mentioned in thos threads and the performance is still low.

We know that “.engine” performance is degraded regarding “.tlt” one. We wonder if there is any way to check “.etlt” performance as iti is the intermediate step.

Do you have any way of checking it?

Best Regards,
Alberto

The .etlt is not the cause of degradation.
And officially end users can only deploy the .etlt model in deepstream to run inference.
As mentioned in above link Issue with image classification tutorial and testing with deepstream-app - #21 by Morganh , please note that there are several hints to improve the inference accuracy.

No, it is not expected. And previously we found that the .engine can have similar result as .tlt model, using deepstream way or standalone inference way.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.