Cuda Error in launchPwgenKernel- When running a specific engine in async

Description

When I run a specific engine file (YOLO v3) in python asynchronously using streams and threads, I get the following error when starting a thread:
ERROR: …/rtExt/cuda/pointwiseV2Helpers.h (538) - Cuda Error in launchPwgenKernel: 400 (invalid resource handle)
[TensorRT] ERROR: FAILED_EXECUTION: std::exception

It’s a C++ error that does not crash the run, but obviously it doesn’t apply inference in this case.
When running the same code asynchronously with a different model this runs without errors at all. So it seems to be related to a specific node that crashes when running asynchronously.

Environment

TensorRT Version: 7.1.2
GPU Type:
Nvidia Driver Version:
CUDA Version: 11.0
Operating System + Version: Ubuntu 18.04
Python Version (if applicable): 3.6
TensorFlow Version (if applicable): The model was trained on tf 1.15, converted to onnx, and then converted to tensorRT engine.

Notes:

  • It applies inference correctly on the onnx model
  • I can apply inference on this engine file when I run the code in non-async mode
  • I can run the same code asynchronously on some other engine file (standard imagenet classificatio model, for example)
  • The engine file is of a YOLO v3 model
  • When converting the model from onnx to trt engine, I can see “PointWiseV2” mentioned several times, i.e. “>>>>>>>>>>>>>>> Chose Runner Type: PointWiseV2 Tactic: 9”, so this hints again the specific node that might cause it.

I can’t share the model, but this is the general logic for the multithreading:

class myThread(Thread):

def __init__(self, func, args):
  Thread.__init__(self)
  self.func = func
  self.args = args
   
def run(self):
  print ("Starting " + self.args[0])
  self.func(*self.args)
  print ("Exiting " + self.args[0])

class TRTInference:

def __init__(self, engine, trt_engine_datatype, batch_size, num_classes, N_run):
    self.cfx = cuda.Device(0).make_context()
    stream = cuda.Stream()

    trt.init_libnvinfer_plugins(TRT_LOGGER, '')

    context = engine.create_execution_context()

    # prepare buffer
    host_inputs  = []
    cuda_inputs  = []
    host_outputs = []
    cuda_outputs = []
    bindings = []

    for binding in engine:
        size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
        host_mem = cuda.pagelocked_empty(size, np.float32)
        cuda_mem = cuda.mem_alloc(host_mem.nbytes)

        bindings.append(int(cuda_mem))
        if engine.binding_is_input(binding):
            host_inputs.append(host_mem)
            cuda_inputs.append(cuda_mem)
        else:
            host_outputs.append(host_mem)
            cuda_outputs.append(cuda_mem)

    # store
    self.stream  = stream
    self.context = context
    self.engine  = engine
    self.batch_size = batch_size
    self.num_classes  = num_classes

    self.host_inputs = host_inputs
    self.cuda_inputs = cuda_inputs
    self.host_outputs = host_outputs
    self.cuda_outputs = cuda_outputs
    self.bindings = bindings

    self.preprocess_func = preprocessing.preprocess_detector_yolo
    self.N_run = N_run

    trt_engine_path = args.engine 
    max_batch_size = args.batch_size
    # deserialize engine
    TRT_LOGGER = trt.Logger(trt.Logger.INFO)
    runtime = trt.Runtime(TRT_LOGGER)
    with open(trt_engine_path, 'rb') as f:
        buf = f.read()
        engine = runtime.deserialize_cuda_engine(buf)
    trt_inference_wrapper = TRTInference(engine,
                                         trt_engine_datatype=trt.DataType.FLOAT,
                                         batch_size=max_batch_size, num_classes=args.num_classes,
                                         N_run=args.n_run)

    # assign a thread for each image
    threads_list = []
    # but first apply warmup inference without a thread
    n_warmup = args.n_warmup
    for path_id, input_img_path in enumerate(filenames):

        cur_thread = myThread(trt_inference_wrapper.infer_async, [input_img_path])
        threads_list.append(cur_thread)
        cur_thread.start()

The error occurs when I call the cur_thread.start()
Would appreciate any help why this might occur on the scenario I described above :)

Hi @weissrael,
Can you please help us with verbose logs to assist you better.
Thanks!

I attach here the verbose logs for the inference and for the converter (onnx --> engine file). The errors at the end of the inference logs are repeated several times- each time a thread has started, so I included only the first time it happens.
verbose_logs_converter_detector.txt (416.7 KB) verbose_logs_inference_detector.txt (1.8 KB)

Hi @weissrael,
The issue might be due to mismatch of environment.
Can you please try the inference in the environment you used to build the engine?
If it works in the same environment , you probably needs to update driver in the failing environment.
Thanks!

Hi,
I apply inference in the same environment that I built the engine- on this engine it applies inference without errors only on synchronous mode; on asynchronous (multi streams and threads) this inference fails… Is it still related to drivers in this case?

Hi @weissrael
Can you try your onnx model with trtexec to check if the issue persist?
https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html#trtexec
Thanks!

I tried running it with trtexec and the --threads flag, and it applies inference successfully with trtexec. Hmmm…

So looks like the issue is with your script.

invalid resource handle is probably the Cuda Stream.
Cuda Stream and Cuda Pointers are bound to Cuda context. we can’t use one stream/memory created using one Cuda context with another Cuda context.
If we do, that exact error happens.
For synchronous API, the default nullptr stream is used and that nullptr stream works on any cuda context.