[TensorRT] ERROR: 1: [reformat.cu::NCHWToNCHHW2::1038] Error Code 1: Cuda Runtime (invalid resource handle)

essamgouda97 · October 28, 2021, 10:27pm

I have 3 scripts:

1- My main script where I load a trt engine that has 2 inputs and 1 output, then reads two types of inputs (here I am just creating random tensors with the same shape). For the audo_data tensors I need to convert them to run on the GPU so I can preprocess them using torchaudio (due to no MKL support for ARM CPUs) and then convert these tensors back to CPU to pass them for the TRT engine.

import torch 
from libs.model_classes.trt_model import TRTMODEL

model = TRTMODEL("./sample.engine", 1)
model._load_model()

for _ in range(10):
    video_data = torch.rand((1,8,3,224,224))
    audio_data = torch.rand((1,8,18,64)).to("cuda:0")
    input = [video_data, audio_data.to("cpu")]

    preds, preds_label = model._run_inference(input)

    print(preds, preds_label, sep=":")

2- My TRT model class, initalizes the model and passes its params to an inference function:

from libs.model_classes.abstract_model_class import ABSTRACT_MODEL
from libs.video_frame_prediction_fun import video_frame_prediction_trt

import tensorrt as trt
import pycuda.driver as cuda
import pycuda.autoinit

import numpy as np

class TRTMODEL(ABSTRACT_MODEL):
    def __init__(self,
                model_path: str,
                batch_size: int):

        self.model_path = model_path
        self.batch_size = batch_size
    
    def _load_model(self):
        f = open(self.model_path, "rb")
        self.runtime = trt.Runtime(trt.Logger(trt.Logger.WARNING)) 

        self.engine = self.runtime.deserialize_cuda_engine(f.read())
        self.context = self.engine.create_execution_context()

        bindings = [(bind,self.engine.binding_is_input(bind)) for bind in self.engine] #[(input_bind, True), (output_bind, False)]

        # TRT MODEL DATA
        self.input_shape = []
        self.input_size = []
        self.device_input = []
        # self.output_shape = []
        # self.host_output = []
        # self.device_output = []

        # Create a stream in which to copy inputs/outputs and run inference.
        self.stream = cuda.Stream()

        for bind, isInput in bindings:
            temp_shape = self.engine.get_binding_shape(bind)
            print("[DEBUG] Shape for ", bind, " isInput = ", isInput, " shape: ", temp_shape)
            if isInput: #input layer
                self.input_shape.append(temp_shape)

                self.input_size.append(trt.volume(self.input_shape[-1]) * self.engine.max_batch_size * np.dtype(np.float32).itemsize) # in bytes

                self.device_input.append(cuda.mem_alloc(self.input_size[-1]))
            else: # TODO handle multiple outputs
                self.output_shape = temp_shape

                # create page-locked memory buffers (i.e. won't be swapped to disk)
                self.host_output = cuda.pagelocked_empty(trt.volume(self.output_shape) * self.engine.max_batch_size, dtype=np.float32)

                self.device_output = cuda.mem_alloc(self.host_output.nbytes)

    def _run_inference(self, input_frames):
        preds, preds_label = video_frame_prediction_trt(input_frames, self.device_input, self.device_output, self.stream, self.context, self.host_output, self.batch_size)

        return preds, preds_label

3- My inference function:

def video_frame_prediction_trt(input_frames, device_input, device_output, stream, context, host_output, batch_size):
    
    host_input = []
    for i in range(len(input_frames)):
        host_input.append(np.array(input_frames[i].numpy(), dtype=np.float32, order='C'))
        cuda.memcpy_htod_async(device_input[i], host_input[-1], stream)  

    # run inference
    bindings_device_list = [int(dev) for dev in device_input]
    bindings_device_list.append(int(device_output))
    context.execute_async(bindings=bindings_device_list, stream_handle=stream.handle)
    cuda.memcpy_dtoh_async(host_output, device_output, stream)
    stream.synchronize()

    preds = torch.Tensor(host_output).reshape(batch_size, -1, 2)

    preds = F.softmax(preds, dim=-1)
    _, preds_label = preds.max(dim=-1)

    preds_label = preds_label.reshape(1,-1)[0]
    
    return preds, preds_label

I get the cuda runtime error only when I convert the audio_data to cuda:0 device but if I don’t everything work as expected, the problem is I need to convert the audio data to cuda:0 so any suggestions to solve that are really appreciated, thank you !

The error that I get is:

[TensorRT] ERROR: 1: [reformat.cu::NCHWToNCHHW2::1038] Error Code 1: Cuda Runtime (invalid resource handle)

I am assuming when converting my audio data to cuda:0 somehow it messes up the memory internally, not sure how to properly check that. I tried different models with changing the worspace variable while converting my onnx model to trt engine thinking maybe I run out of memory but according to my tests this wasn’t the case.

AastaLLL · October 29, 2021, 3:16am

Hi,

TensorRT can read CUDA buffer directly.
Please change the memcpy_htod_async into memcpy_dtod_async and try it again.

https://documen.tician.de/pycuda/driver.html#pycuda.driver.memcpy_dtod_async

Thanks.

essamgouda97 · October 29, 2021, 9:47am

Thank you for your fast reply.
I tried changing it and got this error instead:

cuda.memcpy_dtod_async(device_input[i], host_input[-1], input_size[i],stream)  
Boost.Python.ArgumentError: Python argument types in
    pycuda._driver.memcpy_dtod_async(DeviceAllocation, numpy.ndarray, int, Stream)
did not match C++ signature:
    memcpy_dtod_async(unsigned long long dest, unsigned long long src, unsigned long size, pycudaboost::python::api::object stream=None)

note that input_size is from this line in script #2:

self.input_size.append(trt.volume(self.input_shape[-1]) * self.engine.max_batch_size * np.dtype(np.float32).itemsize) # in bytes

essamgouda97 · November 3, 2021, 6:53am

Hello the documentation isn’t really clear in regards to memcpy_dtod_async.
Do I need now to pass the input tensors from the GPU directly or is it sufficient to pass as it is ? Because when I do so I get a did not match C++ signature error.

AastaLLL · November 5, 2021, 5:19am

Hi,

It seems that your input source is consist of two input.

    input = [video_data, audio_data.to("cpu")]

Please note that mixing a buffer with CPU memory and GPU memory is not supported.

You can either use a CPU buffer (htod copy) or a GPU buffer (dtod copy).
But the type of video_data and audio_data need to be aligned.

Thanks.

system · November 24, 2021, 7:04am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
TensorRT error: Cuda Runtime (invalid resource handle), Torchaudio on GPU & trt engine，get wrong result TensorRT tensorrt , cuda , pytorch , pycuda	3	4577	January 10, 2022
Cuda Runtime (invalid resource handle) when use TensorRT and Pytorch(on GPU) simultaneously TensorRT	5	3104	December 17, 2024
[TensorRT] ERROR: ../rtSafe/cuda/reformat.cu (925) - Cuda Error in NCHWToNCHHW2: 400 (invalid resource handle) TensorRT tensorrt , cuda	3	2303	March 16, 2021
TensorRT ERROR: pointWiseV2Helpers.h::launchPwgenKernel::532 Cuda Driver (invalid resource handle) Jetson Xavier NX tensorrt , cuda , jetson-inference	3	2125	March 24, 2022
"Cuda Error in NCHWTONCHHW2: 33 (invalid resource handle) "，How to solve it? Jetson Nano cuda	30	6500	October 18, 2021
[genericReformat.cuh::copyPackedRunKernel::1487] Error Code 1: Cuda Runtime (invalid resource handle) TensorRT	0	163	August 19, 2024
TensorRT10.3 Cuda Runtime Error When Directly Using cuda device inputs for function execute_async_v3 TensorRT cuda	1	200	April 22, 2025
Pycuda Error During running infrence on tensorrt on Jetson Nano Jetson Nano tensorrt	6	1440	October 15, 2021
Invalid resource handle when doing inference with TensorRT TensorRT	3	2730	August 10, 2022
TensorRT Error TensorRT cudnn	0	107	September 2, 2024

[TensorRT] ERROR: 1: [reformat.cu::NCHWToNCHHW2::1038] Error Code 1: Cuda Runtime (invalid resource handle)

Related topics