Different results when loading multiple TensorRT models sequentially in same script


  • AGX Xavier
  • Jetpack 4.6
  • TensorRT 8.0.1
  • CUDA10.2
  • cuDNN 8.2.1

I am basically running a script that runs a face detection model, then runs a tracker, then runs a classifier model on the detected faces. I am facing an issue where:

  • When I load the face detection model on its own, and then load the face classifier model and run some tests, I have a stray -1.875 and -1. in my model. My code:
#load in the models
stream = cuda.Stream()
TRT_LOGGER = trt.Logger()
explicit_batch = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
#load in the face detector tensorrt model
with open("weights/yolov5s-face-448x800.trt", "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
	fd_engine = runtime.deserialize_cuda_engine(f.read())
for binding in fd_engine:
	if fd_engine.binding_is_input(binding):
		fd_device_input = cuda.mem_alloc(trt.volume(fd_engine.get_binding_shape(binding)) * fd_engine.max_batch_size * np.dtype(np.float32).itemsize)
		fd_host_output = cuda.pagelocked_empty(trt.volume(fd_engine.get_binding_shape(binding)) * fd_engine.max_batch_size, dtype=np.float32)
		fd_device_output = cuda.mem_alloc(fd_host_output.nbytes)
fd_context = fd_engine.create_execution_context()
#load in the face classifier tensorrt model
with open("weights/resnet34_fc_fp16.engine", "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
	fc_engine = runtime.deserialize_cuda_engine(f.read())
for binding in fc_engine:
	if fc_engine.binding_is_input(binding):
		fc_device_input = cuda.mem_alloc(abs(trt.volume(fc_engine.get_binding_shape(binding))) * fc_engine.max_batch_size * np.dtype(np.float32).itemsize)
		fc_host_output = cuda.pagelocked_empty(abs(trt.volume(fc_engine.get_binding_shape(binding))) * fc_engine.max_batch_size, dtype=np.float32)
		fc_device_output = cuda.mem_alloc(fc_host_output.nbytes)
fc_context = fc_engine.create_execution_context()

#Test face classifier
test_batchsize = 3
fc_context.set_binding_shape(0, (test_batchsize, 3, 224, 224))
batch_list = []
for j in range(test_batchsize):
	img = Image.open("test_materials/turkish_coffee.jpg")
	img = img.resize((224, 224))
	if img.mode == "RGBA":
		img = img.convert("RGB")
	convert_to_tensor = transforms.ToTensor()
	normalize = transforms.Normalize(mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225])
input = torch.stack(batch_list)
fc_host_input = np.array(input.numpy(), dtype=np.float32, order='C')
cuda.memcpy_htod_async(fc_device_input, fc_host_input, stream)
fc_context.execute_async_v2(bindings=[int(fc_device_input), int(fc_device_output)], stream_handle=stream.handle)
cuda.memcpy_dtoh_async(fc_host_output, fc_device_output, stream)

I get this result:
As can be seen the -1.875 and -1 are strays.
But suppose that I remove the loading of the face detector tensorrt model:

#load in the models
stream = cuda.Stream()
TRT_LOGGER = trt.Logger()
explicit_batch = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
#load in the face classifier tensorrt model
with open("weights/resnet34_fc_fp16.engine", "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
	fc_engine = runtime.deserialize_cuda_engine(f.read())
for binding in fc_engine:
	if fc_engine.binding_is_input(binding):
		fc_device_input = cuda.mem_alloc(abs(trt.volume(fc_engine.get_binding_shape(binding))) * fc_engine.max_batch_size * np.dtype(np.float32).itemsize)
		fc_host_output = cuda.pagelocked_empty(abs(trt.volume(fc_engine.get_binding_shape(binding))) * fc_engine.max_batch_size, dtype=np.float32)
		fc_device_output = cuda.mem_alloc(fc_host_output.nbytes)
fc_context = fc_engine.create_execution_context()

#Test face classifier
test_batchsize = 3
fc_context.set_binding_shape(0, (test_batchsize, 3, 224, 224))
batch_list = []
for j in range(test_batchsize):
	img = Image.open("test_materials/turkish_coffee.jpg")
	img = img.resize((224, 224))
	if img.mode == "RGBA":
		img = img.convert("RGB")
	convert_to_tensor = transforms.ToTensor()
	normalize = transforms.Normalize(mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225])
input = torch.stack(batch_list)
fc_host_input = np.array(input.numpy(), dtype=np.float32, order='C')
cuda.memcpy_htod_async(fc_device_input, fc_host_input, stream)
fc_context.execute_async_v2(bindings=[int(fc_device_input), int(fc_device_output)], stream_handle=stream.handle)
cuda.memcpy_dtoh_async(fc_host_output, fc_device_output, stream)

I instead get this:

Could someone advise why this is happening?


This might be an issue.
Would you mind sharing the testing image and model so we can reproduce this on our side?


Hi, gladly!

The image is turkish_coffee. yolov5s-face-448x800.trt is the trt model for face detection, with height 448 width 800. resnet34_fc_fp16.engine is a trt model for image classification, with height 224 and width 224.

I’ve also attached the code I used to convert and test the classifier. The yolov5s-face-448x800.trt model is converted from GitHub - deepcam-cn/yolov5-face: YOLO5Face: Why Reinventing a Face Detector (https://arxiv.org/abs/2105.12931) ECCV Workshops 2022).

Thank you for your help!

yolov5s-face-448x800.trt (20.8 MB)
resnet34_fc_fp16.engine (40.9 MB)
convert_classifier.py (3.8 KB)


Thanks for the source and data.

Confirmed that we can reproduce the same issue in our environment.
Will share more information with you later.


Thank you.


We found this is not a bug.

Based on the source, the batch size is set to 3 when inference.

test_batchsize = 3
fc_context.set_binding_shape(0, (test_batchsize, 3, 224, 224))

So only the corresponding buffer will have a valid result.
In your case, it should be the first 6 items ( output size (=2) * batch (=3) ).

The unexpected value occurs on the invalid region.
But these values should not be used so this is not a bug.


