How to implement TensorRT as an inference server?

Kevin3297 · May 28, 2019, 4:35am

Hi, all

I am new to TensorRT and I am trying to implement an inference server using TensorRT. Everything works good if I just run the engine in one thread. However, I have encountered some problem when I try to run the engine in multiple threads. For example, the following is sample code from tensorrt/samples/python/end_to_end_tensorflow_mnist/sample.py. I just modified it make it run in multi-thread manner:

def infer(model_file, data_path):

    with build_engine(model_file) as engine:
        # Build an engine, allocate buffers and create a stream.
        # For more information on buffer allocation, refer to the introductory samples.
        inputs, outputs, bindings, stream = common.allocate_buffers(engine)
        with engine.create_execution_context() as context:
            case_num = load_normalized_test_case(data_path, pagelocked_buffer=inputs[0].host)
            # For more information on performing inference, refer to the introductory samples.
            # The common.do_inference function will return a list of outputs - we only have one in this case.
            [output] = common.do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
            pred = np.argmax(output)
            print("Test Case: " + str(case_num))
            print("Prediction: " + str(pred))

def main():
    data_path = common.find_sample_data(description="Runs an MNIST network using a UFF model file", subfolder="mnist")
    model_file = ModelData.MODEL_FILE

    # This works fine
    infer(model_file, data_path)

    # Error 
    t = threading.Thread(target=infer, args=(model_file, data_path))
    t.start()
    t.join()

When I try to run it in different threads, it gives me the following error:

pycuda._driver.LogicError: explicit_context_dependent failed: invalid device context - no currently active context?

Try 1:

It seems like the CUDA context in the new thread is not initialized. So I comment out “import pycuda.autoinit” and try to initialize CUDA context. I add following code at the beginning and end of the ‘infer()’ function

cuda.init()
device = cuda.Device(0)
ctx = device.make_context()
# infer body
...
ctx.pop()

This works fine for the MNIST example. However, when I try another engine with CNN, I got the following error:

[TensorRT] ERROR: cuda/cudaConvolutionLayer.cpp (163) - Cudnn Error in execute: 7
[TensorRT] ERROR: cuda/cudaConvolutionLayer.cpp (163) - Cudnn Error in execute: 7

And again, this engine works good if I only run it in single thread.

Now I have no idea how to solve these problems. Anyone has any suggestions how to implement TensorRT as a multi-thread inference server? Any suggestion will be appreciated.

Thanks.

wennysprin · October 22, 2019, 2:07am

hello qliang, i got same problem but no solution, mark this topic and may help each other.

15618561709 · October 24, 2019, 2:40pm

Hi, all

I am new to TensorRT and I am trying to implement an inference server using TensorRT. Everything works good if I just run the engine in one thread. However, I have encountered some problem when I try to run the engine in multiple threads. For example, the following is sample code from tensorrt/samples/python/end_to_end_tensorflow_mnist/sample.py. I just modified it make it run in multi-thread manner:
def infer(model_file, data_path):

    with build_engine(model_file) as engine:
        # Build an engine, allocate buffers and create a stream.
        # For more information on buffer allocation, refer to the introductory samples.
        inputs, outputs, bindings, stream = common.allocate_buffers(engine)
        with engine.create_execution_context() as context:
            case_num = load_normalized_test_case(data_path, pagelocked_buffer=inputs[0].host)
            # For more information on performing inference, refer to the introductory samples.
            # The common.do_inference function will return a list of outputs - we only have one in this case.
            [output] = common.do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
            pred = np.argmax(output)
            print("Test Case: " + str(case_num))
            print("Prediction: " + str(pred))

def main():
    data_path = common.find_sample_data(description="Runs an MNIST network using a UFF model file", subfolder="mnist")
    model_file = ModelData.MODEL_FILE

    # This works fine
    infer(model_file, data_path)

    # Error 
    t = threading.Thread(target=infer, args=(model_file, data_path))
    t.start()
    t.join()
When I try to run it in different threads, it gives me the following error:
pycuda._driver.LogicError: explicit_context_dependent failed: invalid device context - no currently active context?
Try 1:

It seems like the CUDA context in the new thread is not initialized. So I comment out “import pycuda.autoinit” and try to initialize CUDA context. I add following code at the beginning and end of the ‘infer()’ function
cuda.init()
device = cuda.Device(0)
ctx = device.make_context()
# infer body
...
ctx.pop()
This works fine for the MNIST example. However, when I try another engine with CNN, I got the following error:
[TensorRT] ERROR: cuda/cudaConvolutionLayer.cpp (163) - Cudnn Error in execute: 7
[TensorRT] ERROR: cuda/cudaConvolutionLayer.cpp (163) - Cudnn Error in execute: 7
And again, this engine works good if I only run it in single thread.

Now I have no idea how to solve these problems. Anyone has any suggestions how to implement TensorRT as a multi-thread inference server? Any suggestion will be appreciated.

Thanks.

do you solve this problem? i also meet it in childprocess. that confused me!! please help

Topic		Replies	Views
Can TensorRT do inference in a child thread ? TensorRT	6	2202	August 11, 2020
How to use TensorRT by the multi-threading package of python Jetson AGX Xavier tensorrt	13	18535	October 18, 2021
TensorRT do_inference error TensorRT	19	8374	November 14, 2022
Can multiple CUDA contexts share an inference engine? TensorRT tensorrt , cuda	3	42	January 21, 2025
Cuda Error in launchPwgenKernel- When running a specific engine in async TensorRT tensorrt	9	2155	June 11, 2022
[TensorRT] engine happed a error in multithreaded TensorRT tensorrt , cuda	2	1542	January 19, 2023
TensorRT inference context in ROS callback TensorRT tensorrt , cuda	13	2539	January 8, 2023
Multiple threads execution with different engines in tensorrt TensorRT tensorrt	3	2453	December 13, 2022
[TensorRT] ERROR: 1: [resize.cu::performLinearKernelLaunch::457] Error Code 1: Cuda Runtime (invalid argument) TensorRT tensorrt , cupy	4	5149	June 14, 2022
Tensorrt inference with pytorch tensor(data_ptr) TensorRT tensorrt , cuda , pytorch	2	1836	June 11, 2021

How to implement TensorRT as an inference server?

Related topics