Please Help : TensorRT with Thread ERROR : 1: [pointWiseV2Helpers.h::launchPwgenKernel::532] Error Code 1: Cuda Driver (invalid resource handle)

Hello,

I got an error with my code in notebook executing a tensorRT model with threading. It work well without threading. Can someone help me to solve this ?

The tensorRT model I build is with a compiled version of tensorRT 8205 on my jetson nano from source in github.

I use the compiled version of tensorRT 8205 Python API on my jetson nano too.

The tensorRT model is one create from a onnx created with Tensorflow SSD-Mobilenetv2_320x320 from lastest python script in github (ssd_mobilenet_v2_fpnlite_320x320_coco17_tpu-8).

Again, it work very well if I do not use threading : I can call the engine with tensortRT api and the model is able to detect objetct with classes… I got no error.

There are preprocessing in the model and shape seems to be dynamic because dynamic appear in the log… can it be the problem ? Do i need to convert the onnx to non dynamic shape before convert it with tensorrt ?

Thanks you very much !

I upload the engine and the notebook and the onnx too.

My code is :

model.engine (9.1 MB)

Essai.ipynb (11.8 KB)
model (1).onnx (10.4 MB)

import tensorrt as trt

# Construction de la class du logger
class MyLogger(trt.ILogger):
    def __init__(self):
        trt.ILogger.__init__(self)

    def log(self, severity, msg):
        print("%s : %s" %(severity,msg))
        pass
import threading

class myThread(threading.Thread):
   def __init__(self, func):
      threading.Thread.__init__(self)
      self.func = func
   def run(self):
      print ("Starting ")
      self.func()
      print ("Exiting ")
import numpy as np
import pycuda.autoinit
import pycuda.driver as cuda
import threading
import time


class TRTInference:
    def __init__(self,repertoire_engine):
        # Initialisation du runtime TensorRT
        self.logger = MyLogger()
        #self.TRT_LOGGER = trt.Logger(self.logger)
        trt.init_libnvinfer_plugins(self.logger, namespace="")
        self.runtime = trt.Runtime(self.logger)
        
        # Chargement du moteur
        print("Chargement du moteur...")
        with open(repertoire_engine, "rb") as f:
            self.engine = self.runtime.deserialize_cuda_engine(f.read())
        
        #Initialisation du context Cuda et du contexte TensorRT 
        self.cfx = cuda.Device(0).make_context()
        self.stream = cuda.Stream()
        self.context = self.engine.create_execution_context()
        
        # Réservation de la mémoire pour l'entrée
        print("Allocation mémoire...")
        size_input = trt.volume(self.engine.get_binding_shape(0))*self.engine.max_batch_size
        self.input_host_mem = cuda.pagelocked_empty(size_input, trt.nptype(trt.float32))
        self.input_device_mem = cuda.mem_alloc(self.input_host_mem.nbytes)

        # Réservation de la mémoire pour les sorties
        self.output_device_mem = [];
        format_sorties = [];
        types_sorties = [];

        for i in range(self.engine.num_bindings):
            if not self.engine.binding_is_input(i):
                size_output = trt.volume(self.engine.get_binding_shape(i))*self.engine.max_batch_size
                output_hm = cuda.pagelocked_empty(size_output, trt.nptype(trt.float32))
                self.output_device_mem.append(cuda.mem_alloc(output_hm.nbytes))
                format_sorties.append(self.engine.get_binding_shape(i))
                types_sorties.append(trt.nptype(self.engine.get_binding_dtype(i)))

        # Récupère les adresses en GPU des buffers entrées / sorties
        binding_entree = int(self.input_device_mem)
        binding_sorties = []

        for output_ in self.output_device_mem:
            binding_sorties.append(int(output_))
        self.bindings = [binding_entree, binding_sorties[0],binding_sorties[1],binding_sorties[2],binding_sorties[3]]

        # Allocation de la mémoire hote pour les sorties
        self.output_host_mem = []
        for i in range(len(self.output_device_mem)):
            self.output_host_mem.append(np.zeros(format_sorties[i],types_sorties[i]))
        
        # Input tensor
        self.image = np.zeros((320,320,3), dtype=trt.nptype(self.engine.get_binding_dtype(0)))

        
    # Inférence
    def CalculModele(self):
        threading.Thread.__init__(self)
        self.cfx.push()

        # Copie de l'image dans le tenseur d'entrée
        x = self.image.astype(np.float32)
        x = np.expand_dims(x,axis=0)                    # (1,320,320,3)
        np.copyto(self.input_host_mem,x.ravel())
        
        # Transfert de l'entrée vers le GPU
        cuda.memcpy_htod(self.input_device_mem, self.input_host_mem)
        
        # Appel du modèle
        self.context.execute(batch_size=1,bindings=self.bindings)
        
        # Récupération des sorties
        for i in range(len(self.output_host_mem)):
            cuda.memcpy_dtoh(self.output_host_mem[i], self.output_device_mem[i])
        self.cfx.pop()
        
    def destory(self):
        self.cfx.pop()

trt_inference_wrapper = TRTInference(repertoire_engine="model.engine")

Severity.VERBOSE : Registered plugin creator - ::BatchTilePlugin_TRT version 1
Severity.VERBOSE : Registered plugin creator - ::BatchedNMS_TRT version 1
Severity.VERBOSE : Registered plugin creator - ::BatchedNMSDynamic_TRT version 1
Severity.VERBOSE : Registered plugin creator - ::CoordConvAC version 1
Severity.VERBOSE : Registered plugin creator - ::CropAndResize version 1
Severity.VERBOSE : Registered plugin creator - ::CropAndResizeDynamic version 1
Severity.VERBOSE : Registered plugin creator - ::DecodeBbox3DPlugin version 1
Severity.VERBOSE : Registered plugin creator - ::DetectionLayer_TRT version 1
Severity.VERBOSE : Registered plugin creator - ::EfficientNMS_TRT version 1
Severity.VERBOSE : Registered plugin creator - ::EfficientNMS_ONNX_TRT version 1
Severity.VERBOSE : Registered plugin creator - ::EfficientNMS_Explicit_TF_TRT version 1
Severity.VERBOSE : Registered plugin creator - ::EfficientNMS_Implicit_TF_TRT version 1
Severity.VERBOSE : Registered plugin creator - ::FlattenConcat_TRT version 1
Severity.VERBOSE : Registered plugin creator - ::GenerateDetection_TRT version 1
Severity.VERBOSE : Registered plugin creator - ::GridAnchor_TRT version 1
Severity.VERBOSE : Registered plugin creator - ::GridAnchorRect_TRT version 1
Severity.VERBOSE : Registered plugin creator - ::InstanceNormalization_TRT version 1
Severity.VERBOSE : Registered plugin creator - ::LReLU_TRT version 1
Severity.VERBOSE : Registered plugin creator - ::MultilevelCropAndResize_TRT version 1
Severity.VERBOSE : Registered plugin creator - ::MultilevelProposeROI_TRT version 1
Severity.VERBOSE : Registered plugin creator - ::DMHA version 1
Severity.VERBOSE : Registered plugin creator - ::NMS_TRT version 1
Severity.VERBOSE : Registered plugin creator - ::NMSDynamic_TRT version 1
Severity.VERBOSE : Registered plugin creator - ::Normalize_TRT version 1
Severity.VERBOSE : Registered plugin creator - ::PillarScatterPlugin version 1
Severity.VERBOSE : Registered plugin creator - ::PriorBox_TRT version 1
Severity.VERBOSE : Registered plugin creator - ::ProposalLayer_TRT version 1
Severity.VERBOSE : Registered plugin creator - ::Proposal version 1
Severity.VERBOSE : Registered plugin creator - ::ProposalDynamic version 1
Severity.VERBOSE : Registered plugin creator - ::PyramidROIAlign_TRT version 1
Severity.VERBOSE : Registered plugin creator - ::Region_TRT version 1
Severity.VERBOSE : Registered plugin creator - ::Reorg_TRT version 1
Severity.VERBOSE : Registered plugin creator - ::ResizeNearest_TRT version 1
Severity.VERBOSE : Registered plugin creator - ::RPROI_TRT version 1
Severity.VERBOSE : Registered plugin creator - ::ScatterND version 1
Severity.VERBOSE : Registered plugin creator - ::SpecialSlice_TRT version 1
Severity.VERBOSE : Registered plugin creator - ::Split version 1
Severity.VERBOSE : Registered plugin creator - ::VoxelGeneratorPlugin version 1
Severity.INFO : [MemUsageChange] Init CUDA: CPU +197, GPU +0, now: CPU 236, GPU 1416 (MiB)
Chargement du moteur…
Severity.INFO : Loaded engine size: 9 MB
Severity.INFO : [MemUsageSnapshot] deserializeCudaEngine begin: CPU 245 MiB, GPU 1434 MiB
Severity.VERBOSE : Using cublas a tactic source
Severity.INFO : [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +158, GPU +246, now: CPU 404, GPU 1681 (MiB)
Severity.VERBOSE : Using cuDNN as a tactic source
Severity.INFO : [MemUsageChange] Init cuDNN: CPU +241, GPU +354, now: CPU 645, GPU 2035 (MiB)
Severity.INFO : [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +0, GPU +0, now: CPU 645, GPU 2035 (MiB)
Severity.VERBOSE : Deserialization required 6288056 microseconds.
Severity.INFO : [MemUsageSnapshot] deserializeCudaEngine end: CPU 645 MiB, GPU 2035 MiB
Severity.INFO : [MemUsageSnapshot] ExecutionContext creation begin: CPU 838 MiB, GPU 2228 MiB
Severity.VERBOSE : Using cublas a tactic source
Severity.INFO : [MemUsageChange] Init cuBLAS/cuBLASLt: CPU +158, GPU +157, now: CPU 996, GPU 2385 (MiB)
Severity.VERBOSE : Using cuDNN as a tactic source
Severity.INFO : [MemUsageChange] Init cuDNN: CPU +240, GPU +240, now: CPU 1236, GPU 2625 (MiB)
Severity.VERBOSE : Total per-runner device memory is 7467008
Severity.VERBOSE : Total per-runner host memory is 203456
Severity.VERBOSE : Allocated activation device memory of size 29763584
Severity.INFO : [MemUsageSnapshot] ExecutionContext creation end: CPU 1238 MiB, GPU 2661 MiB
Allocation mémoire…

thread1 = myThread(trt_inference_wrapper.CalculModele)

# Start new Threads
    thread1.start()
    thread1.join()
    trt_inference_wrapper.destory();
    print ("Exiting Main Thread")

Starting
Severity.ERROR : 1: [pointWiseV2Helpers.h::launchPwgenKernel::532] Error Code 1: Cuda Driver (invalid resource handle)
Exiting
Exiting Main Thread

Here is a log that verbose some information from cuda. Hope it can help …

Starting
Severity.VERBOSE : About to execute: Name: preprocessor/transpose, LayerType: Shuffle, LayerName: Shuffle, Inputs: [ { Name: input_tensor, Dimensions: [1,320,320,3], Format/Datatype: Row major linear FP32 }], Outputs: [ { Name: preprocessor/transpose:0_0, Dimensions: [1,3,320,320], Format/Datatype: Row major linear FP32 }], InputRegions: [{ Name: input_tensor, Type: USER, IsBroadcastAcrossN: false, Dimensions: [1,320,320,3], Strides: [307200,960,3,1], Region Format/DataType: Row major linear FP32}], OutputRegions: [{ Name: preprocessor/transpose:0_0, Type: LINEAR, IsBroadcastAcrossN: false, Dimensions: [1,3,320,320], Strides: [307200,102400,320,1], Region Format/DataType: Row major linear FP32}], ParameterType: Shuffle, FirstTranspose: [0,3,1,2], Reshape: “nbDims=-1”, SecondTranspose: [0,1,2,3], ZeroIsPlaceholder: 1, TacticValue: 0x0

Severity.VERBOSE : Debug synchronize completed successfully after execute: Name: preprocessor/transpose, LayerType: Shuffle, LayerName: Shuffle, Inputs: [ { Name: input_tensor, Dimensions: [1,320,320,3], Format/Datatype: Row major linear FP32 }], Outputs: [ { Name: preprocessor/transpose:0_0, Dimensions: [1,3,320,320], Format/Datatype: Row major linear FP32 }], InputRegions: [{ Name: input_tensor, Type: USER, IsBroadcastAcrossN: false, Dimensions: [1,320,320,3], Strides: [307200,960,3,1], Region Format/DataType: Row major linear FP32}], OutputRegions: [{ Name: preprocessor/transpose:0_0, Type: LINEAR, IsBroadcastAcrossN: false, Dimensions: [1,3,320,320], Strides: [307200,102400,320,1], Region Format/DataType: Row major linear FP32}], ParameterType: Shuffle, FirstTranspose: [0,3,1,2], Reshape: “nbDims=-1”, SecondTranspose: [0,1,2,3], ZeroIsPlaceholder: 1, TacticValue: 0x0

Severity.VERBOSE : About to execute: Name: PWN(preprocessor/mean_value:0, PWN(preprocessor/scale_value:0 + preprocessor/scale, preprocessor/mean)), LayerType: PointWiseV2, LayerName: PointWiseV2, Inputs: [ { Name: preprocessor/transpose:0_0, Dimensions: [1,3,320,320], Format/Datatype: Row major linear FP32 }], Outputs: [ { Name: preprocessor/mean:0_2, Dimensions: [1,3,320,320], Format/Datatype: Row major linear FP32 }], InputRegions: [{ Name: preprocessor/transpose:0_0, Type: LINEAR, IsBroadcastAcrossN: false, Dimensions: [1,3,320,320], Strides: [307200,102400,320,1], Region Format/DataType: Row major linear FP32}], OutputRegions: [{ Name: preprocessor/mean:0_2, Type: LINEAR, IsBroadcastAcrossN: false, Dimensions: [1,3,320,320], Strides: [307200,102400,320,1], Region Format/DataType: Row major linear FP32}], ParameterType: PointWise, ParameterSubType: PointWiseExpression, NbInputArgs: 1, InputArgs: [“arg0”], NbOutputVars: 1, OutputVars: [“var1”], NbParams: 0, Params: , NbLiterals: 3, Literals: [“1.000000e+00f”, “0.000000e+00f”, “7.843138e-03f”], NbOperations: 2, Operations: [“const auto var0 = pwgen::iMul(arg0, literal2);”, “const auto var1 = pwgen::iMinus(var0, literal0);”], TacticValue: 0x8

Severity.ERROR : 1: [pointWiseV2Helpers.h::launchPwgenKernel::532] Error Code 1: Cuda Driver (invalid resource handle)

Exiting
Exiting Main Thread

I found a solution : I use retain_primary_context() instead of make_context() each time needed and now it works.

Glad to know issue resolved, thanks for the update!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.