Torch & TRT Hybrid Python solution - TRT Python migration from pycuda to cuda-python interface

Description

Getting different results while inference the same torch tensor data Using TRT Python interface and torch forward.

Environment

TensorRT Version: 8.6.1.6
GPU Type: RTX 4090 mobile
Nvidia Driver Version: 546.24
CUDA Version: 11.8
CUDNN Version: 8.9.7
Operating System + Version: Windows 11 Pro 10.0.22631 Build 22631
Python Version (if applicable): 3.10.11
TensorFlow Version (if applicable): NA
PyTorch Version (if applicable): 2.2.1+cu118
Baremetal or Container (if container which image + tag): Baremetal

Overview

I am trying to bring up ASAP a proof of concept solution.

Based on that, I want to base my POC solution on Python only.

For some reasons, the POC shall include both Torch and TRT libraries working together.

Recently, I figured out that the TRT Python samples package, which is part of the TRT SDK, move to use
cuda-python instead of pycuda interface.

The major changes can be found in the file common.py which includes all cuda-python required APIs activations such memory allocation, memory copy, stream allocation etc.

My question related to the way user can load its torch tensor data into the TRT binding.

The implementation provided by the common.py demonstrate a technique which based on two steps:

  1. User shall copy its CPU array to page lock (host) memory which allocated using cudart.cudaMallocHost

  2. Copy the data again from the page lock memory to the device memory which allocated using cudart.cudaMalloc that will be declared as TRT binding.

But what if I want to minimize the latency?

For example, I already have a torch tensor mapped to CUDA and I want to activate the TRT inference on it with minimum latency.

Instead of:

  1. Download the torch tensor data from CUDA to CPU and convert it to numpy using torch.tensor.cpu().numpy()

  2. Copy it to the page lock host memory using numpy copyto method

  3. Upload it to the device memory (TRT binding) using cudart.cudaMemcpyAsync

I want to directly copy the torch.tensor data, which already mapped to CUDA, to the device memory which used as a TRT binding – step 3 above.

I don’t want that the CPU will be involved at all.

Using torch.tensor data_ptr() method I was able to get the GPU raw pointer of the torch tensor.

But when I tried to activate the cudart.cudaMemcpyAsync method with the appropriate adaptations such as source and destination address, cudaMemcpyKind etc, I found that the TRT inference didn’t return the same results as the Torch forward operation did on the same tensor input data.

No CUDA error was reported at all and if I returned to activate the TRT using the longer way, its works fine and the results are OK comparing them to the Torch Forword operation results.

Do you have any experience such this?

I tried to find examples in the net but couldn’t find anything…

I hope there is no limitation usage regarding to the cuda-python because the sequence above work totally fine based on the C++ interfaces.

Regards,