Different results when loading multiple TensorRT models sequentially in same script

realimposter · February 23, 2022, 3:31pm

Specifications:

AGX Xavier
Jetpack 4.6
TensorRT 8.0.1
CUDA10.2
cuDNN 8.2.1

I am basically running a script that runs a face detection model, then runs a tracker, then runs a classifier model on the detected faces. I am facing an issue where:

When I load the face detection model on its own, and then load the face classifier model and run some tests, I have a stray -1.875 and -1. in my model. My code:

#load in the models
stream = cuda.Stream()
TRT_LOGGER = trt.Logger()
explicit_batch = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
#load in the face detector tensorrt model
with open("weights/yolov5s-face-448x800.trt", "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
	fd_engine = runtime.deserialize_cuda_engine(f.read())
for binding in fd_engine:
	if fd_engine.binding_is_input(binding):
		fd_device_input = cuda.mem_alloc(trt.volume(fd_engine.get_binding_shape(binding)) * fd_engine.max_batch_size * np.dtype(np.float32).itemsize)
	else:
		fd_host_output = cuda.pagelocked_empty(trt.volume(fd_engine.get_binding_shape(binding)) * fd_engine.max_batch_size, dtype=np.float32)
		fd_device_output = cuda.mem_alloc(fd_host_output.nbytes)
fd_context = fd_engine.create_execution_context()
#load in the face classifier tensorrt model
with open("weights/resnet34_fc_fp16.engine", "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
	fc_engine = runtime.deserialize_cuda_engine(f.read())
for binding in fc_engine:
	if fc_engine.binding_is_input(binding):
		fc_device_input = cuda.mem_alloc(abs(trt.volume(fc_engine.get_binding_shape(binding))) * fc_engine.max_batch_size * np.dtype(np.float32).itemsize)
	else:
		fc_host_output = cuda.pagelocked_empty(abs(trt.volume(fc_engine.get_binding_shape(binding))) * fc_engine.max_batch_size, dtype=np.float32)
		fc_device_output = cuda.mem_alloc(fc_host_output.nbytes)
fc_context = fc_engine.create_execution_context()

#Test face classifier
test_batchsize = 3
fc_context.set_binding_shape(0, (test_batchsize, 3, 224, 224))
batch_list = []
for j in range(test_batchsize):
	img = Image.open("test_materials/turkish_coffee.jpg")
	img = img.resize((224, 224))
	if img.mode == "RGBA":
		img = img.convert("RGB")
	convert_to_tensor = transforms.ToTensor()
	normalize = transforms.Normalize(mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225])
	batch_list.append(normalize(convert_to_tensor(img)))
input = torch.stack(batch_list)
fc_host_input = np.array(input.numpy(), dtype=np.float32, order='C')
cuda.memcpy_htod_async(fc_device_input, fc_host_input, stream)
fc_context.execute_async_v2(bindings=[int(fc_device_input), int(fc_device_output)], stream_handle=stream.handle)
cuda.memcpy_dtoh_async(fc_host_output, fc_device_output, stream)
stream.synchronize()
print(fc_host_output)

I get this result:

As can be seen the -1.875 and -1 are strays.
But suppose that I remove the loading of the face detector tensorrt model:

#load in the models
stream = cuda.Stream()
TRT_LOGGER = trt.Logger()
explicit_batch = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
#load in the face classifier tensorrt model
with open("weights/resnet34_fc_fp16.engine", "rb") as f, trt.Runtime(TRT_LOGGER) as runtime:
	fc_engine = runtime.deserialize_cuda_engine(f.read())
for binding in fc_engine:
	if fc_engine.binding_is_input(binding):
		fc_device_input = cuda.mem_alloc(abs(trt.volume(fc_engine.get_binding_shape(binding))) * fc_engine.max_batch_size * np.dtype(np.float32).itemsize)
	else:
		fc_host_output = cuda.pagelocked_empty(abs(trt.volume(fc_engine.get_binding_shape(binding))) * fc_engine.max_batch_size, dtype=np.float32)
		fc_device_output = cuda.mem_alloc(fc_host_output.nbytes)
fc_context = fc_engine.create_execution_context()

#Test face classifier
test_batchsize = 3
fc_context.set_binding_shape(0, (test_batchsize, 3, 224, 224))
batch_list = []
for j in range(test_batchsize):
	img = Image.open("test_materials/turkish_coffee.jpg")
	img = img.resize((224, 224))
	if img.mode == "RGBA":
		img = img.convert("RGB")
	convert_to_tensor = transforms.ToTensor()
	normalize = transforms.Normalize(mean = [0.485, 0.456, 0.406], std = [0.229, 0.224, 0.225])
	batch_list.append(normalize(convert_to_tensor(img)))
input = torch.stack(batch_list)
fc_host_input = np.array(input.numpy(), dtype=np.float32, order='C')
cuda.memcpy_htod_async(fc_device_input, fc_host_input, stream)
fc_context.execute_async_v2(bindings=[int(fc_device_input), int(fc_device_output)], stream_handle=stream.handle)
cuda.memcpy_dtoh_async(fc_host_output, fc_device_output, stream)
stream.synchronize()
print(fc_host_output)

I instead get this:

Could someone advise why this is happening?

AastaLLL · February 24, 2022, 3:42am

Hi,

This might be an issue.
Would you mind sharing the testing image and model so we can reproduce this on our side?

Thanks.

realimposter · February 24, 2022, 5:41am

Hi, gladly!

The image is turkish_coffee. yolov5s-face-448x800.trt is the trt model for face detection, with height 448 width 800. resnet34_fc_fp16.engine is a trt model for image classification, with height 224 and width 224.

I’ve also attached the code I used to convert and test the classifier. The yolov5s-face-448x800.trt model is converted from GitHub - deepcam-cn/yolov5-face: YOLO5Face: Why Reinventing a Face Detector (https://arxiv.org/abs/2105.12931) ECCV Workshops 2022).

Thank you for your help!

turkish_coffee
yolov5s-face-448x800.trt (20.8 MB)
resnet34_fc_fp16.engine (40.9 MB)
convert_classifier.py (3.8 KB)

AastaLLL · February 25, 2022, 8:32am

Hi,

Thanks for the source and data.

Confirmed that we can reproduce the same issue in our environment.
Will share more information with you later.

Thanks.

realimposter · February 26, 2022, 1:03am

Thank you.

AastaLLL · March 1, 2022, 4:59am

Hi,

We found this is not a bug.

Based on the source, the batch size is set to 3 when inference.

test_batchsize = 3
fc_context.set_binding_shape(0, (test_batchsize, 3, 224, 224))

So only the corresponding buffer will have a valid result.
In your case, it should be the first 6 items ( output size (=2) * batch (=3) ).

The unexpected value occurs on the invalid region.
But these values should not be used so this is not a bug.

Thanks.

system · March 23, 2022, 6:19am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.