TensorRT error: Cuda Runtime (invalid resource handle), Torchaudio on GPU & trt engine,get wrong result

Description

a simple audio classifier model. First extracts Mel spectrogram with torchaudio on GPU. Second do the model inference on the same GPU, but get the wrong result.
it is strange that if I extract the Mel spectrogram on the CPU and inference on GPU, the result is correct.

HERE is my code:

def wav_to_frames(wave_data, win_len=int(16000 * 6.5)):
 
    num_frames = round(len(wave_data) / win_len)
    frame_len, wave_len = num_frames * win_len, len(wave_data)
    if frame_len > wave_len:
        x = F.pad(wave_data, (0, frame_len - wave_len))
    elif frame_len < wave_len:
        x = wave_data[:frame_len]
    else:
        x = wave_data

    return x.view(-1, 1, win_len)

def torchaudio_extract(waveform):
   
    torchaudio_melspec = torchaudio.transforms.MelSpectrogram(
        sample_rate=16000,
        n_fft=512,
        win_length=512,
        hop_length=160,
        center=True,
        pad_mode="reflect",
        power=2.0,
        norm='slaney',
        onesided=True,
        n_mels=64,
    ).to(torch.device('cuda'))(waveform)

    return torchaudio_melspec.transpose(1, 2)


classTRTGPUdev():
    def __init__(self, model_path, onnx_path=None):
     
        self.TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
        self.engine = self.get_engine(model_path)
        self.context = self.engine.create_execution_context()
        self.inputs, self.outputs, self.bindings, self.stream = self.allocate_buffers()

    def allocate_buffers(self):
        inputs, outputs, bindings = [], [], []
        stream = cuda.Stream()
        for binding in self.engine:
            size = trt.volume(self.engine.get_binding_shape(binding)) * self.engine.max_batch_size
            trt_dtype = trt.nptype(self.engine.get_binding_dtype(binding))
            host_mem = cuda.pagelocked_empty(size, trt_dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            bindings.append(int(device_mem))
            if self.engine.binding_is_input(binding):
                inputs.append(HostDeviceMem(host_mem, device_mem))
            else:
                outputs.append(HostDeviceMem(host_mem, device_mem))
        return inputs, outputs, bindings, stream

    def get_engine(self, trt_path):
        with open(trt_path, "rb") as f, trt.Runtime(self.TRT_LOGGER) as runtime:
            return runtime.deserialize_cuda_engine(f.read())

    def do_inference_v2(self):
        [cuda.memcpy_htod_async(inp.device, inp.host, self.stream) for inp in self.inputs]
        self.context.execute_async_v2(bindings=self.bindings, stream_handle=self.stream.handle)
        [cuda.memcpy_dtoh_async(out.host, out.device, self.stream) for out in self.outputs]
        self.stream.synchronize()
        return [out.host for out in self.outputs]

    def trt_engine(self, audio_path):
       # 1. move data to GPU
        wave_data, sr = torchaudio.load(audio_path)
        wave_data = wave_data.to(torch.device('cuda'))
        wavs = wav_to_frames(wave_data[0], int(6.5 * 16000))
       # 2. extract features with torchaudio on GPU
        feats = [torchaudio_extract(i).reshape(1, 1, 651, 64) for i in waves]

       # 3. do inference with trt
        result, msg = [], []
        for index, data in enumerate(feats):
            feed_data = data.cpu().detach().numpy()
            self.inputs[0].host = feed_data
            trt_outputs = self.do_inference_v2()

            if trt_outputs[0][1] > 0.8 and index < 6:
                msg.append(self.time_tagging[index])
            result.append(copy.deepcopy(trt_outputs[0]))
        return result

result :


[01/06/2022-17:51:41] [TRT] [W] TensorRT was linked against cuBLAS/cuBLAS LT 11.6.3 but loaded cuBLAS/cuBLAS LT 11.5.1
[01/06/2022-17:51:42] [TRT] [W] TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.2.0
[01/06/2022-17:51:42] [TRT] [W] TensorRT was linked against cuBLAS/cuBLAS LT 11.6.3 but loaded cuBLAS/cuBLAS LT 11.5.1
[01/06/2022-17:51:42] [TRT] [W] TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.2.0
[01/06/2022-17:51:48] [TRT] [E] 1: [scaleRunner.cpp::execute::144] Error Code 1: Cuda Runtime (invalid resource handle)
[01/06/2022-17:51:48] [TRT] [E] 1: [scaleRunner.cpp::execute::144] Error Code 1: Cuda Runtime (invalid resource handle)
[01/06/2022-17:51:48] [TRT] [E] 1: [scaleRunner.cpp::execute::144] Error Code 1: Cuda Runtime (invalid resource handle)

trt:    [array([0., 0.], dtype=float32), array([0., 0.], dtype=float32), array([0., 0.], dtype=float32)] []

BUT when I extract feature on CPU, just change the code .to(torch.device(‘cuda’)) to .to(torch.device(‘cpu’)) will get the correct result as the follow:

[01/06/2022-18:18:46] [TRT] [W] TensorRT was linked against cuBLAS/cuBLAS LT 11.6.3 but loaded cuBLAS/cuBLAS LT 11.5.1
[01/06/2022-18:18:46] [TRT] [W] TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.2.0
[01/06/2022-18:18:46] [TRT] [W] TensorRT was linked against cuBLAS/cuBLAS LT 11.6.3 but loaded cuBLAS/cuBLAS LT 11.5.1
[01/06/2022-18:18:46] [TRT] [W] TensorRT was linked against cuDNN 8.2.1 but loaded cuDNN 8.2.0

trt:    [array([9.9998915e-01, 1.1265278e-05], dtype=float32), array([0.9989116 , 0.00110829], dtype=float32), array([9.9990976e-01, 8.9466572e-05], dtype=float32)] []

Environment

TensorRT Version: 8.2.1
GPU Type: Tesla P100-PCIE-16GB
Nvidia Driver Version: 450.80.02
CUDA Version: 11.0
CUDNN Version:
Operating System + Version: Centos
Python Version (if applicable): 3.6
PyTorch Version (if applicable): 1.10 , torchaudio:0.10.1

So, What went wrong or how should I fix it?

Hi,

Looks like CUDA context issue. Could you please share us complete script and if possible issue repro resources for better debugging.

Are you using pytorch and pycuda simultaneously ?

Thank you.

demo.trt (14.4 MB)
trt_inference.py (5.7 KB)

Hi,
I really appreciate your reply, the attachment is my code,I’m using pytorch and pycuda simultaneously, I have no idea about the problem if I use them together.
It bothered me for a long time. I’m looking forward to your reply.

Best.

Hi,

I think you need to avoid using PyTorch-GPU and PyCUDA together. Instead of making allocations with PyCUDA, we can use torch tensors directly with TRT (specifically, we can use the data_ptr() method to get the device memory address:
https://pytorch.org/docs/stable/generated/torch.Tensor.data_ptr.html

Please refer following issue for more details,

Thank you.