Running 2 models on the same GPU with TensorRT

Description

Hi,
I’m trying to run 2 TensorRT engines, each with a different model, on the same GPU but I’m getting the error

Cask Error in checkCaskExecError<false>: 7 (Cask Convolution execution).

What is the reason for this error?

Below is my code:

import tensorrt as trt
TRT_LOGGER = trt.Logger(trt.Logger.WARNING)
import pycuda.driver as cuda

class FeatureCalcTrt(object):

    def __init__(self, device_id, params_file):
        self.__device_id = device_id
        self._params_file = params_file
        self.dev = None
        self.ctx = None
        self.engine = None
        self.inputs = None
        self.outputs = None
        self.bindings = None
        self.stream = None
        self.engine_context = None

    def _init_network_impl(self):
        LOG.info("Initializing Network %s", self._device_id)
        os.environ["CUDA_VISIBLE_DEVICES"] = str(self._device_id)
        self.dev = cuda.Device(self._device_id)
        self.ctx = self.dev.make_context()

        with open(self._params_file, 'rb') as f, trt.Runtime(TRT_LOGGER) as runtime:
            self.engine = runtime.deserialize_cuda_engine(f.read())
            self.inputs, self.outputs, self.bindings, self.stream = self._allocate_trt_buffers(self.engine)
            assert len(self.inputs) == 1, "inputs = %s" % self.inputs
            assert len(self.outputs) == 1, "outputs = %s" % self.outputs
            self.engine_context = self.engine.create_execution_context()

    @staticmethod
    def _allocate_trt_buffers(engine):
        inputs = []
        outputs = []
        bindings = []
        stream = cuda.Stream()
        for binding in engine:
            size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
            dtype = trt.nptype(engine.get_binding_dtype(binding))
            # Allocate host and device buffers
            host_mem = cuda.pagelocked_empty(size, dtype)
            device_mem = cuda.mem_alloc(host_mem.nbytes)
            # Append the device buffer to device bindings.
            bindings.append(int(device_mem))
            # Append to the appropriate list.
            if engine.binding_is_input(binding):
                inputs.append((host_mem, device_mem))
            else:
                LOG.info("Allocating output %s %s", engine.get_binding_shape(binding), binding)
                outputs.append((host_mem, device_mem))
        return inputs, outputs, bindings, stream

    @staticmethod
    def _do_trt_inference(context, bindings, inputs, outputs, stream, batch_size=1):
        # Transfer input data to the GPU.
        for inp in inputs:
            LOG.debug("input %s", inp[0])
            cuda.memcpy_htod_async(inp[1], inp[0], stream)
        # Run inference.
        context.execute_async(batch_size=batch_size, bindings=bindings, stream_handle=stream.handle)
        # Transfer predictions back from the GPU.
        for out in outputs:
            cuda.memcpy_dtoh_async(out[0], out[1], stream)
        # Synchronize the stream
        stream.synchronize()
        # Return only the host outputs.
        LOG.debug("output %s", outputs[0][0])
        return [out[0] for out in outputs]

    @staticmethod
    def _process_trt_batch(patches, context, inputs, outputs, bindings, stream):
        pagelocked_buffer = inputs[0][0]
        flat_patches = np.array(patches).ravel()
        data_size = len(flat_patches)
        np.copyto(pagelocked_buffer[:data_size], flat_patches)
        [cur_features] = FeatureCalcTrt._do_trt_inference(context, bindings=bindings, inputs=inputs, outputs=outputs, stream=stream,
                                                          batch_size=len(patches))
        return cur_features.copy()

    def calc_features_one_batch(self, patches):
        return self._process_trt_batch(patches, self.engine_context, self.inputs, self.outputs, self.bindings, self.stream)


if __name__ == '__main__':
    cuda.init()
    f1 = FeatureCalcTrt(0, 'params1.trt')
    f2 = FeatureCalcTrt(0, 'params2,trt')
    patches = // code that reads patches
    res1 = f1.calc_features_one_batch(patches)
    res2 = f2.calc_features_one_batch(patches)

Environment

TensorRT Version: 7.0.0
GPU Type: RTX 2080 ti
Nvidia Driver Version: 410.78
CUDA Version: 10.0
CUDNN Version: 7.4.2
Operating System + Version: Ubuntu 18.04
Python Version (if applicable): 2.7

Hi @trillian.2020.09.01,
Please refer to below link for thread safety related guidelines:
https://docs.nvidia.com/deeplearning/sdk/tensorrt-archived/tensorrt-700/tensorrt-best-practices/index.html#thread-safety

Thanks!

Thanks!
I’m not sure how these are relevant to my case as:

  1. I run in a single thread.
  2. I use a single logger.
    Can you please explain in more detail which guideline is not followed here?

Hi @trillian.2020.09.01,

Thanks for your asking! I am also interested in your topic!

I wonder what kind of idea is what you discussed about?

Thank you.

BR,
Chieh

Hi @trillian.2020.09.01,

To the error, you can find the possible solution here.

Thanks!

Hi, I discussed plan B from your figure in the post above.

Thanks again. As far as I understand the link that you sent talks about multi-threaded code while as explained above I got this error when running in one thread (both the initialization of CUDA and the call to TRT are on the same thread), so it doesn’t help with my problem.

Hi @trillian.2020.09.01,
apologies for delayed response, are you still facing the issue?