Discrepancy between results from tlt-infer and trt engine

I’m doing the inference with yolov3 tensorrt engine converted by tlt-converter, however I found that the inference result from tensorrt engine and that form tlt-infer are different. I think that might due to differences during pre-processing stage. Since I could not get access to the pre-processing part of tlt-infer, I’ve attached below that part for my tensorrt engine:

frame = cv2.imread(img_path)
reso = (416, 416)
ratio_h0, retio_w0 = 416 / frame.shape[0], 416 / frame.shape[1]
frame = cv2.resize(frame, (reso[0], reso[1]))
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) # BGR → RGB
mean = np.array([123.68, 116.779, 103.939], dtype=np.float32).reshape((3,1,1))
frame = frame.transpose(2, 0, 1).astype(‘float32’) - mean
input0 = torch.as_tensor(frame).unsqueeze(0).to(device_m)

For example, here are two results of an input image randomly choose in VOC dataset. Above is the result from my trt engine, the lower one from tlt-infer.

For preprocess of RGB image,
inf_img = np.array(inf_img).astype(np.float32)
inference_input = preprocess_input(inf_img.transpose(2, 0, 1))

Ok, so to be clear, we only need those two transformations (without other processing like padding, mean extraction etc) to do the inference with trt engine?

Padding is needed.
Mean extraction is also needed. It is included in preprocess_input of “keras.applications.imagenet_utils”

All right, could you please describe the complete preprocessing stage for yolov3 trt engine generated from TLT? That would be very helpful.

For the preprocessing of a RGB image,

  1. do not change aspect_ratio, then resize(img.resize) the original image
  2. create(image.new) an image which corresponds to model input width/height
  3. paste(image.paste) (1) to (2)
  4. inf_img = np.array(inf_img).astype(np.float32)
    inference_input = preprocess_input(inf_img.transpose(2, 0, 1))

More, suggest you to run deepsteam inference to check the result firstly.
Make sure deepstream can run your trt engine correctly comparing to tlt-infer.

Ok, understood, I’ll try with those steps. Thanks a lot for those details and suggestions!

By the way, to be precise in implementation, the term “do not change aspect-ratio” you mentioned in step 1 imply the padding operation, and could you specify which function or padding pattern is needed, please?

In addition, I also tried with the preprocessing configuration in deepstream_tlt_apps, which works not so well ( https://github.com/NVIDIA-AI-IOT/deepstream_tlt_apps/blob/master/pgie_yolov3_tlt_config.txt)
In this file, I found only three preprocessing steps and no padding (line 48-50):

// normalization scale is 1
// mean extraction in BGR order
// 1 refers to channels in BGR order

You can consider step (3) as padding.

Please mask sure you can run deepstream well.
You can try to run default yolo models in GitHub - NVIDIA-AI-IOT/deepstream_tao_apps: Sample apps to demonstrate how to deploy models trained with TAO on DeepStream.

def letterbox(img, new_shape=(416, 416), color=(128, 128, 128)):
# Resize image to a 32-pixel-multiple rectangle https://github.com/ultralytics/yolov3/issues/232
shape = img.shape[:2]  # current shape [height, width]
# Scale ratio (new / old)
r = min(new_shape[0] / shape[0], new_shape[1] / shape[1])

# Compute padding
new_unpad = int(round(shape[1] * r)), int(round(shape[0] * r))
dw, dh = new_shape[1] - new_unpad[0], new_shape[0] - new_unpad[1]  # wh padding
dw, dh = np.mod(dw, 32), np.mod(dh, 32)  # wh padding

dw /= 2  # divide padding into 2 sides
dh /= 2
if shape[::-1] != new_unpad:  # resize
    img = cv2.resize(img, new_unpad, interpolation=cv2.INTER_LINEAR)
top, bottom = int(round(dh - 0.1)), int(round(dh + 0.1))
left, right = int(round(dw - 0.1)), int(round(dw + 0.1))
img = cv2.copyMakeBorder(img, top, bottom, left, right, cv2.BORDER_CONSTANT, value=color)  # add border

return img

Here is the code I used for padding, however implementing this padding makes the result even worse. Is that the right way to do? Or, perhaps could you provide an example of code for padding, please?

Please refer to my steps mentioned and commented above. I think I already describe clearly.
Actually you can use Deepstream to run inference. It is default example for deploying etlt models or trt engine.

Right, if you agree with my padding process, I think I’ve already implement all those steps you mentioned, still, I could not get the consistent result with tlt.

As for the deepstream validation, I’m afraid it’s not an option for me as I’m working on images validation rather than video stream.

Deepstream can run inference against images.

./deepstream-custom -c pgie_config_file -i <H264 or JPEG filename> [-b BATCH] [-d]
    -h: print help info
    -c: pgie config file, e.g. pgie_frcnn_tlt_config.txt
    -i: H264 or JPEG input file
    -b: batch size, this will override the value of "baitch-size" in pgie config file
    -d: enable display, otherwise dump to output H264 or JPEG file

Hi Morganh,
After implementing the preprocessing steps you mentioned before, bounding boxes and scores from trt engine are getting closer to the results of tlt-infer, with some minor differences.

However, I got a lot of wrong label assignments, and I observe that they are exactly assigned to the previous label of the correct one. Here are my list of voc categories and some visualizations:
classes_voc = ["aeroplane", "bicycle", "bird", "boat", "bottle", "bus", "car", "cat", "chair", \ "cow", "diningtable", "dog", "horse", "motorbike", "person", "pottedplant", "sheep", "sofa", "train", "tvmonitor"]

Do you have any ideas on what might cause this mislabeling ?

Can you deploy the same trt engine with deepstream to check if it can be reproduced?

I also tried with deepstream, however, I still has problems with label assignment.

Can you narrow down you issue via

  1. deploy the etlt model in deepstream
  2. run the default yolo jupyter notebook and then deploy its etlt model or trt engine in deepstream?

Hi Morganh,
Now I’m trying to do the similar expriment with RetinaNet, to see if I could get consistent results. Therefore, I would like to ask the preprocessing steps for RetinaNet please?

In TLT the pre-processing of Retinanet is like below:

  • assume RGB input values in range from 0.0 to 255.0 as float

  • change from RGB to BGR

  • then subtract channels of input values by 103.939;116.779;123.68 separately for BGR channels.