How to implement TensorRT as an inference server?

Hi, all

I am new to TensorRT and I am trying to implement an inference server using TensorRT. Everything works good if I just run the engine in one thread. However, I have encountered some problem when I try to run the engine in multiple threads. For example, the following is sample code from tensorrt/samples/python/end_to_end_tensorflow_mnist/ I just modified it make it run in multi-thread manner:

def infer(model_file, data_path):

    with build_engine(model_file) as engine:
        # Build an engine, allocate buffers and create a stream.
        # For more information on buffer allocation, refer to the introductory samples.
        inputs, outputs, bindings, stream = common.allocate_buffers(engine)
        with engine.create_execution_context() as context:
            case_num = load_normalized_test_case(data_path, pagelocked_buffer=inputs[0].host)
            # For more information on performing inference, refer to the introductory samples.
            # The common.do_inference function will return a list of outputs - we only have one in this case.
            [output] = common.do_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream)
            pred = np.argmax(output)
            print("Test Case: " + str(case_num))
            print("Prediction: " + str(pred))

def main():
    data_path = common.find_sample_data(description="Runs an MNIST network using a UFF model file", subfolder="mnist")
    model_file = ModelData.MODEL_FILE

    # This works fine
    infer(model_file, data_path)

    # Error 
    t = threading.Thread(target=infer, args=(model_file, data_path))

When I try to run it in different threads, it gives me the following error:

pycuda._driver.LogicError: explicit_context_dependent failed: invalid device context - no currently active context?

Try 1:

It seems like the CUDA context in the new thread is not initialized. So I comment out “import pycuda.autoinit” and try to initialize CUDA context. I add following code at the beginning and end of the ‘infer()’ function

device = cuda.Device(0)
ctx = device.make_context()
# infer body

This works fine for the MNIST example. However, when I try another engine with CNN, I got the following error:

[TensorRT] ERROR: cuda/cudaConvolutionLayer.cpp (163) - Cudnn Error in execute: 7
[TensorRT] ERROR: cuda/cudaConvolutionLayer.cpp (163) - Cudnn Error in execute: 7

And again, this engine works good if I only run it in single thread.

Now I have no idea how to solve these problems. Anyone has any suggestions how to implement TensorRT as a multi-thread inference server? Any suggestion will be appreciated.


hello qliang, i got same problem but no solution, mark this topic and may help each other.

do you solve this problem? i also meet it in childprocess. that confused me!! please help