how to run trt in multithreading?

Recently, i run a trt in single thread cost 6ms, then run the same trt in 2 threads cost 10ms, it looks like the threads nteract with each other, but i put the trt in 2 processes and run at the same time ,it’s ok , cost 6ms. so i want to konw why and what can i do if i need to run trt some times at the same time ?


TensorRT supports multiple threads so long as each is used with a separate execution context.


you said i know,i mean is that multiple threads cost time is longer than single thread on 1 GPU, how to solve the time problem?


Sorry for the unclear explanation.

Have you launched the TensorRT models with separate execution context?
This is essential for running inference in parallel or some latency will occur for sharing the GPU resource.


in my program, one thread binding one different TensorRT context.
in addition, i use TensorRT API and Plugin (contain some kernel function) to create my network.

i think the reason may be the sharing the GPU resource (it’s just my guess.), but is there any solution?


It’s recommended to profile GPU utilization with nvprof first.

Another possible reason is the limited resource of memory bandwidth.
Have you used memcopy in your plugin implementation?


Have you used memcopy in your plugin implementation?
What’s wrong with that and any Suggestions?


Due to some hardware issue, we don’t support asynchronous memory copy on Jetson.
This may have some impact in the parallelism of TenosrRT if memory copy is used.


i run my program on x86 machine,does the problem still exist?


Asynchronous memory copy run well on x86 machine.

Usually, concurrency mechanism requires executing with multiple CUDA stream.
Have you launched your TensorRT context with independent CUDA stream?

Here is a tutorial for your reference:


@AastaLLL hi, how about using the MPS, can that achieve the concurrency mechanism?


You can find some suggestions for TensorRT with multithread here:


Two questions:

  1. if I want to be more efficient, should I use batch or multithreading?
  2. trt runtime can make multiple contexts, and one engine can also create multiple contexts, what’s the difference between the context created in these two ways?

Dear @AastaLLL,

I need your favor!
I have read this document but I still have no idea how to exactly do on python.

Currently I have a sample which can successfully run on TRT.
Now I just want to run TensorRT by multi-threading with a really simple code.
(I have generated the TensorRT engine. so I will load an engine and do TensorRT inference by multi-threading.)

Here is my code below. (Without the Tensorrt code)

import threading
import time
from my_tensorrt_code import TRTInference, trt

exitFlag = 0

class myThread(threading.Thread):
   def __init__(self, func, args):
      self.func = func
      self.args = args
   def run(self):
      print ("Starting " + self.args[0])
      print ("Exiting " + self.args[0])

if __name__ == '__main__':
    # Create new threads
    format thread:
        - func: function names, function that we wished to use
        - arguments: arguments that will be used for the func's arguments

    trt_engine_path = './tensorrt_engine.trt'

    max_batch_size = 1
    trt_inference_wrapper = TRTInference(trt_engine_path, 

    # Get TensorRT SSD model output
    input_img_path = './testimage.png'

    thread1 = myThread(trt_inference_wrapper.infer, [input_img_path])

    # Start new Threads
    print ("Exiting Main Thread")

However, when I run this code, I always got this error messages below.

[TensorRT] ERROR: ../rtSafe/cuda/caskConvolutionRunner.cpp (290) - Cask Error in checkCaskExecError<false>: 7 (Cask Convolution execution)
[TensorRT] ERROR: FAILED_EXECUTION: std::exception

I found that this error message would get error during doing the do_inference function.

def do_inference(context, bindings, inputs, outputs, stream, batch_size=1):
    [cuda.memcpy_htod_async(inp.device,, stream) for inp in inputs]
    context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
    [cuda.memcpy_dtoh_async(, out.device, stream) for out in outputs]
    return [ for out in outputs]

Could you share me some suggestions that how to fix this error?
This error happened not only on desktop but also on Jetson devices…

Thank you so much!

Best regards,

is there any example available for multi threading use?