PyTorch CUDA tensors as TRT engine bindings

Description

I want to do inference with a TensorRT engine on PyTorch GPU tensors. However, using the code below, if I create the tensors after I have created my execution context, I get the following error:

import tensorrt as trt
import torch
import pycuda.driver as cuda
import pycuda.autoinit

TRT_LOGGER = trt.Logger(trt.Logger.INFO)

with open("model.engine", "rb") as f, trt.Runtime(TRT_LOGGER) as runtime, runtime.deserialize_cuda_engine(f.read()) as engine, engine.create_execution_context() as context:

    output_buffer = cuda.mem_alloc(4*288*768*4)
    stream = cuda.Stream()

    for i in range(1):
        tensor = torch.randn((4, 288, 768, 4), dtype=float, device=torch.device('cuda'))        
        context.execute_async_v2(bindings=[int(tensor.data_ptr()), int(output_buffer)], 
                      stream_handle=stream.handle)
        stream.synchronize()

TensorRT] ERROR: …/rtExt/cuda/cudaGatherRunner.cpp (111) - Cuda Error in execute: 400 (invalid resource handle)
[TensorRT] ERROR: FAILED_EXECUTION: std::exception

If I make the tensor before I create the execution context, there are no errors.

import tensorrt as trt
import torch
import pycuda.driver as cuda
import pycuda.autoinit

TRT_LOGGER = trt.Logger(trt.Logger.INFO)

tensor = torch.randn((4, 288, 768, 4), dtype=float, device=torch.device('cuda'))

with open("model.engine", "rb") as f, trt.Runtime(TRT_LOGGER) as runtime, runtime.deserialize_cuda_engine(f.read()) as engine, engine.create_execution_context() as context:
    
    output_buffer = cuda.mem_alloc(4*288*768*4)
    stream = cuda.Stream()

    for i in range(1):                
        context.execute_async_v2(bindings=[int(tensor.data_ptr()), int(output_buffer)], stream_handle=stream.handle)
        stream.synchronize()

Is there any way to create a TRT engine and then perform inference on PyTorch tensors that are created after the execution context? I assume it has to do with CUDA contexts?

Environment

TensorRT Version: 7.2:
GPU Type: Quadro RTX 3000:
Nvidia Driver Version: 460.56:
CUDA Version 11.1:
CUDNN Version:
Operating System + Version:
Python Version: 3.6:
TensorFlow Version (if applicable):
PyTorch Version: 1.8:

Hi,
Request you to share the ONNX model and the script if not shared already so that we can assist you better.
Alongside you can try few things:

  1. validating your model with the below snippet

check_model.py

import sys
import onnx
filename = yourONNXmodel
model = onnx.load(filename)
onnx.checker.check_model(model).
2) Try running your model with trtexec command.
https://github.com/NVIDIA/TensorRT/tree/master/samples/opensource/trtexec
In case you are still facing issue, request you to share the trtexec “”–verbose"" log for further debugging
Thanks!

Thanks for replying. I noticed it works by pushing and popping a cuda context. However, I found out that two different scenario’s work, but I have to say I actually don’t know why because of my limited experience with cuda contexts.

Scenario one: pushing/popping cuda context before/after tensor creation.

import tensorrt as trt
import torch
import pycuda.autoinit
import pycuda.driver as cuda

#create cuda context
ctx = cuda.Device(0).make_context()

TRT_LOGGER = trt.Logger(trt.Logger.INFO)

with open("model.engine", "rb") as f, trt.Runtime(TRT_LOGGER) as runtime, runtime.deserialize_cuda_engine(f.read()) as engine, engine.create_execution_context() as context:

    output_buffer = cuda.mem_alloc(8*288*768*4)
    stream = cuda.Stream()

    for i in range(10):
        ctx.push()
        tensor = torch.ones((8, 288, 768, 4), dtype=torch.float,
                        device=torch.device('cuda'))
        ctx.pop()

        context.execute_async_v2(bindings=[int(tensor.data_ptr()), int(
                  output_buffer)], stream_handle=stream.handle)
        stream.synchronize()


ctx.pop()
exit()

Scenario two: pushing/popping cuda context before/after TRT engine inference.

import tensorrt as trt
import torch
import pycuda.autoinit
import pycuda.driver as cuda

#create cuda context
ctx = cuda.Device(0).make_context()

TRT_LOGGER = trt.Logger(trt.Logger.INFO)

with open("model.engine", "rb") as f, trt.Runtime(TRT_LOGGER) as runtime, runtime.deserialize_cuda_engine(f.read()) as engine, engine.create_execution_context() as context:

    output_buffer = cuda.mem_alloc(8*288*768*4)
    stream = cuda.Stream()

    for i in range(10):
    
        tensor = torch.ones((8, 288, 768, 4), dtype=torch.float,
                        device=torch.device('cuda'))
    

        ctx.push()
        context.execute_async_v2(bindings=[int(tensor.data_ptr()), int(
                   output_buffer)], stream_handle=stream.handle)
        stream.synchronize()
        ctx.pop()


ctx.pop()
1 Like

Hi @rneven,

Looks like you’re using both PyTorch and PyCUDA. We recommend you to use PyTorch device tensors directly and drop PyCUDA completely. It would be better to avoid PyCUDA if you’re using torch . PyTorch also includes various CUDA APIs.

Thank you.