TX2 SCRFD model tensorRT conversion faulty

Hello!

I am trying to run a SCRFD 500m model on TX2 from https://github.com/deepinsight/insightface/blob/master/detection/scrfd. I managed to convert the model into a TensorRT engine, however the outputs are wildly different from the version I have deployed in my Laptop, which was running the ONNX version of the model as explained in the model repo.

I am aware that TensorRT conversion might not convert some of the preprocessing and postprocessing steps, but there are no minimal examples for SCRFD on how to manage these inputs & outputs. For example, can I give the same exact input to the TRT, as the ONNX model, or do I need to apply some preprocessing first? How can I know what steps are not included in tensorRT version, if any?

Can I do something to make the TRT engine run exactly the same as the ONNX version? For example,

from insightface.model_zoo import SCRFD
scrfd_model_path = "face/saved_models/det_500m.onnx"
...
scrfd_detector = SCRFD(scrfd_model_path, scrfd_session)
scrfd_bboxes_orig, _ = scrfd_detector.detect(rgb_orig)  # rgb_orig is image as an array

Yields me the outputs I want, but

input_data = preprocess_image(image_path, input_shape)  # just shuffles dimensions and channels
...
np.copyto(inputs[0]["host"], input_data.ravel())
cuda.memcpy_htod_async(inputs[0]["device"], inputs[0]["host"], stream)
context.execute_async_v2(bindings=bindings, stream_handle=stream.handle)
cuda.memcpy_dtoh_async(outputs[0]["host"], outputs[0]["device"], stream)
stream.synchronize()
output_data = outputs[0]["host"].reshape(5, -1)

gives a very different output. Even if I assume the outputs are unfiltered bboxes, they are not meaningful when overlaid on the input image (Many are smaller than 10x10). This indicates the input needs some processing, but I don’t know what they require. If possible, I want to see what tensorRT included as the model.

Hi,

When deploying the model on the desktop, do you get the correct output?
If so, which TensorRT version do you use on the desktop?

Thanks.

Hello!

Turns out TensorRT conversion mixed up the output order, so the postprocessing was working with mixed up values. For future reference, tensorRT engine for SCRFD outputs as fine, the same as the ONNX version, just in a different order as follows:

SCRFD outputs scores, bboxes and keypoints for each stride, so there are 9 output lists. Given strides

self._feat_stride_fpn = [8, 16, 32]

ONNX outputs were ordered as follows:

stride1_scores
stride2_scores
stride3_scores
stride1_bboxes
stride2_bboxes
stride3_bboxes
stride1_keypoints
stride2_keypoints
stride3_keypoints

However TRT engine yields outputs in the following order:

stride1_scores
stride1_bboxes
stride1_keypoints
stride2_scores
stride2_bboxes
stride2_keypoints
stride3_scores
stride3_bboxes
stride3_keypoints

So after inference, change the section of prediction code:

fmc = self.fmc
for idx, stride in enumerate(self._feat_stride_fpn):
  if self.batched:
    scores = net_outs[idx][0]
    bbox_preds = net_outs[idx + fmc][0]
    bbox_preds = bbox_preds * stride
    if self.use_kps:
      kps_preds = net_outs[idx + fmc * 2][0] * stride
  else:
    scores = net_outs[idx]
    bbox_preds = net_outs[idx + fmc]
    bbox_preds = bbox_preds * stride
    if self.use_kps:
      kps_preds = net_outs[idx + fmc * 2] * stride

To something like

fmc = self.fmc
indx=0
for stride in self._feat_stride_fpn:
  if self.batched:
    scores = net_outs[indx][0]
    bbox_preds = net_outs[indx + 1][0]
    bbox_preds = bbox_preds * stride
    if self.use_kps:
      kps_preds = net_outs[indx + 2][0] * stride
  else:
    scores = net_outs[indx]
    bbox_preds = net_outs[indx + 1]
    bbox_preds = bbox_preds * stride
    if self.use_kps:
      kps_preds = net_outs[indx + 2] * stride
  indx += fmc

To obtain the correct output.