Very bad result on tlt mobilenetv2 tensorrt

Description

Hello everyone.
Thanks for nvidia tlt:v3.0-py3. I get very good result on tlt mobilenet-v2 classification model.
Unfortunatly, i convert my tlt model to tensorrt engine (both fp16 and int8). I got very bad results.
With my training data, i have total 3 class. The tensorrt prediction always give me highest score on the third class.
To make it more clear. I test both tlt model and tensorrt model on the same dataset. tlt model give me very good result (99% acc). TRT engine (FP16 and INT8) give very bad results

Environment

TensorRT Version: 7.2.1
GPU Type: RTX 2080ti

Reproduce

Here is my tensorrt code. Please help me take a look

import pycuda.driver as cuda
import pycuda.autoinit
import numpy as np
import tensorrt as trt
import cv2
import torch


TRT_LOGGER = trt.Logger()

# Simple helper data class that's a little nicer to use than a 2-tuple.
class HostDeviceMem(object):
	def __init__(self, host_mem, device_mem):
		self.host = host_mem
		self.device = device_mem

	def __str__(self):
		return "Host:\n" + str(self.host) + "\nDevice:\n" + str(self.device)

	def __repr__(self):
		return self.__str__()

# Allocates all buffers required for an engine, i.e. host/device inputs/outputs.
def allocate_buffers(engine):
	inputs = []
	outputs = []
	bindings = []
	stream = cuda.Stream()
	out_shapes = []
	input_shapes = []
	out_names = []
	max_batch_size = engine.max_batch_size
	for binding in engine:
		binding_shape = engine.get_binding_shape(binding)
		#Fix -1 dimension for proper memory allocation for batch_size > 1
		if binding_shape[0] == -1:
			binding_shape = (1,) + binding_shape[1:]
		size = trt.volume(binding_shape) * max_batch_size
		dtype = trt.nptype(engine.get_binding_dtype(binding))
		# Allocate host and device buffers
		host_mem = cuda.pagelocked_empty(size, dtype)
		device_mem = cuda.mem_alloc(host_mem.nbytes)
		# Append the device buffer to device bindings.
		bindings.append(int(device_mem))
		# Append to the appropriate list.
		if engine.binding_is_input(binding):
			inputs.append(HostDeviceMem(host_mem, device_mem))
			input_shapes.append(engine.get_binding_shape(binding))
		else:
			outputs.append(HostDeviceMem(host_mem, device_mem))
			#Collect original output shapes and names from engine
			out_shapes.append(engine.get_binding_shape(binding))
			out_names.append(binding)
	return inputs, outputs, bindings, stream, input_shapes, out_shapes, out_names, max_batch_size

# This function is generalized for multiple inputs/outputs.
# inputs and outputs are expected to be lists of HostDeviceMem objects.
def do_inference(context, bindings, inputs, outputs, stream):
	# Transfer input data to the GPU.
	[cuda.memcpy_htod_async(inp.device, inp.host, stream) for inp in inputs]
	# Run inference.
	context.execute_async(bindings=bindings, stream_handle=stream.handle)
	# Transfer predictions back from the GPU.
	[cuda.memcpy_dtoh_async(out.host, out.device, stream) for out in outputs]
	# Synchronize the stream
	stream.synchronize()
	# Return only the host outputs.
	return [out.host for out in outputs]

class TrtModel(object):
	def __init__(self, model):
		self.engine_file = model
		self.engine = None
		self.inputs = None
		self.outputs = None
		self.bindings = None
		self.stream = None
		self.context = None
		self.input_shapes = None
		self.out_shapes = None
		self.max_batch_size = 1
		self.cuda_ctx = cuda.Device(0).make_context()
		if self.cuda_ctx:
			self.cuda_ctx.push()


	def build(self):
		
		with open(self.engine_file, 'rb') as f, trt.Runtime(TRT_LOGGER) as runtime:
			self.engine = runtime.deserialize_cuda_engine(f.read())
		self.inputs, self.outputs, self.bindings, self.stream, self.input_shapes, self.out_shapes, self.out_names, self.max_batch_size = allocate_buffers(
			self.engine)

		self.context = self.engine.create_execution_context()
		self.context.active_optimization_profile = 0
		if self.cuda_ctx:
			self.cuda_ctx.pop()

	def run(self, input, deflatten: bool = True, as_dict=False):
		
		# lazy load implementation
		if self.engine is None:
			self.build()
		if self.cuda_ctx:
			self.cuda_ctx.push()

		input = np.asarray(input)
		batch_size = input.shape[0]
		allocate_place = np.prod(input.shape)
		
		self.inputs[0].host[:allocate_place] = input.flatten(order='C').astype(np.float32)
		self.context.set_binding_shape(0, input.shape[1:])
		trt_outputs = do_inference(
			self.context, bindings=self.bindings,
			inputs=self.inputs, outputs=self.outputs, stream=self.stream)

		if self.cuda_ctx:
			self.cuda_ctx.pop()
		#Reshape TRT outputs to original shape instead of flattened array
		if deflatten:
			trt_outputs = [torch.from_numpy(output.reshape(shape)) for output, shape in zip(trt_outputs, self.out_shapes)]
		if as_dict:
			return {name: trt_outputs[i] for i, name in enumerate(self.out_names)}

		return trt_outputs



engine = TrtModel("/data/mobilenet_tensort721.engine")
engine.build()
image = cv2.imread("/data/dataset/train/withmask/2_WORLD_Coronavirus_083975_withmask_4.jpg")
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
resized_rgb_image = cv2.resize(image_rgb, (224, 224))
normed_image = resized_rgb_image.astype('float32') / 255.0
trt_input = np.transpose(normed_image, (2, 0, 1))
trt_input = np.expand_dims(trt_input, axis=0)

trt_outputs = engine.run(trt_input)
print("trt_output: ", np.array(trt_outputs[0]).shape)

Hi,
Request you to share the model, script, profiler and performance output if not shared already so that we can help you better.
Alternatively, you can try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec

While measuring the model performance, make sure you consider the latency and throughput of the network inference, excluding the data pre and post-processing overhead.
Please refer below link for more details:
https://docs.nvidia.com/deeplearning/tensorrt/archives/tensorrt-722/best-practices/index.html#measure-performance
https://docs.nvidia.com/deeplearning/tensorrt/best-practices/index.html#model-accuracy

Thanks!

Hi @NVES
Thanks for your quick respone. Because of bad accuracy, so i can not provide my performance right now.

After get .tlt model, i convert tlt model to tensorrt engine model.

 !tlt-converter $USER_EXPERIMENT_DIR/export/final_model.etlt \
                -k $KEY \
                -c $USER_EXPERIMENT_DIR/export/final_model_int8_cache.bin \
                -o predictions/Softmax \
                -d 3,224,224 \
                -i nchw \
                -m 64 -t int8 \
                -e $USER_EXPERIMENT_DIR/export/mobilenet_tensort7221_batch1.engine \
                -b 1

mobilenet_tensort721.engine (2.8 MB)
Model config (1.4 KB)

For more detail, below is my tlt process.
classification.ipynb (893.7 KB)

Hi,

This looks like TLT related. We recommend you to please post your concern on TAO forum to get better help. Also we recommend you to please use latest TensorRT version.

Thank you.

Thank you

Your result is not bad, but abnormal, you mis used for cuda_ctx.push() and pop. However, I am not gonna tell you how to correctly use it.