A problem of batchsize when convert from onnx to engine file

I have an onnx file and want to convert it to TensorRT engine。When batchsize is set to 1, the output result is correct, when batchsize is 16, the result is wrong。I have tested the correctness of batchsize=16 in pytorch and onnxruntime respectively。
My test environment below:

  • TensorRT 7.2.2
  • cuda 11.1
  • onnx 1.7.0(opset=11)
  • pytorch 1.4.0
  • onnxruntime 1.8.1
    The specific test process is as follows:
  1. set batch_size = 1, i get the result is

max score: 0.271 | min score: 0.006

  1. set batch_size=16, using the same image as input and repeat the matrix 16 times times so that the dimension of the input is [16, 1, 224, 224], correct results can be obtained in pytorch and onnxruntime environment, but the result under tensorrt is wrong, the result is

max score: 1.783 | min score: 1.771

I can upload code and onnx file for error reproduction, hope to get your reply。

stpm_tensorrt.zip (39.9 MB)
I have used trtexec, onnx2trt and python to generate engine. All these methods have the same result when batchsize=16, which is wrong.
Attached is the onnx file and python code.