TensorRT Wrong/No Model Output DeTr Jetson AGX Xavier

Description

Hello,

I am currently working on a transformer project (GitHub - facebookresearch/detr: End-to-End Object Detection with Transformers).
This model is to be imported to a Jetson AGX Xavier.
So I converted the model to a TensorRT model, which worked fine (fp32, fp16, best, …).

Now I am trying to use the trt model on the embedded device. However, the problem is, no matter which trt model I load, the model does not output proper results or I load the results incorrectly. The shape of the results fits, only the content is permanently at 0.

Environment

nvidia-tensorrt 4.6-b199
tensorrt 8.0.1.6-1+cuda10.2
Jetson AGX Xavier
Cuda compilation tools, release 10.2, V10.2.300
Build cuda_10.2_r440.TC440_70.29663091_0

your_model.trt (87.9 MB)
test3.py (6.2 KB)

cant upload bigger models due to upload-limitation

Hi,

This looks like a Jetson issue. Please refer to the below samples in case useful.

For any further assistance, we will move this post to to Jetson related forum.

Thanks!

Thank you in advance.

I believe that the TRT models were compiled correctly. But I am not 100% sure about that either.

Since I have debugged many parts of my code and only a certain section of code is not really working, I would like to ask you to roughly analyze exactly this section of code and confirm if this is the right way to allocate memory and execute computational instructions:

Memory Allocation:

		self.input_dimension = np.empty([
						self.batch_size, 
						self.channel_size, 
						self.image_size, 
						self.image_size], 
						dtype=self.PRECISION)
		self.output_boxes_dimension = np.empty([
						self.batch_size, 
						self.n_predicitons, 
						4], 
						dtype=self.PRECISION)
		self.output_logits_dimension = np.empty([
						self.batch_size, 
						self.n_predicitons, 
						self.n_CLASSES+1], 
						dtype=self.PRECISION)

		input_batch = torch.from_numpy(self.input_dimension)
		output_boxes = torch.from_numpy(self.output_boxes_dimension)
		output_logits = torch.from_numpy(self.output_logits_dimension)

		cuda_inputs = cuda.mem_alloc(input_batch.detach().cpu().numpy().nbytes)
		cuda_outputs_boxes = cuda.mem_alloc(output_boxes.detach().cpu().numpy().nbytes)
		cuda_outputs_logits = cuda.mem_alloc(output_logits.detach().cpu().numpy().nbytes)

		bindings = [int(cuda_inputs), int(cuda_outputs_boxes), int(cuda_outputs_logits)]

Prediction:


		boxes = self.output_boxes_dimension
		logits = self.output_logits_dimension

		start_time = time.time()

		cuda.memcpy_htod_async(cuda_inputs, np_image, stream)
		context.execute_async_v2(bindings, stream.handle, None)
		cuda.memcpy_dtoh_async(boxes, cuda_outputs_boxes, stream)
		cuda.memcpy_dtoh_async(logits, cuda_outputs_logits, stream)
		stream.synchronize()

		print("[perception_detr] Laufzeit Batch: "+ str(time.time()-start_time))

		self.cfx.pop()

		return boxes, logits

Thank you in advance.

Dear @zml-koop,
Do you still have this issue?