Does TensorRT exploit parallelism in a computational graph during inference?


I have a pytorch model, the forward pass looks roughly like the following which I convert to TensorRT 8.4

def forward(x):
    lidar_features = self.lidar_encoder(x['pointcloud'])
    camera_features = self.camera_encoder(x['images'])
    combined_features = torch.stack((lidar_features, camera_features))
    output = self.prediction_head(combined_features)
    return output

During inference, is TensorRT smart enough to know that the lidar encoder and camera encoder can be run at the same time on the GPU, but then a sync needs to be inserted before the torch.stack? Does it do this automatically?


TensorRT Version: 8.4

Can you try running your model with trtexec command, and share the “”–verbose"" log in case if the issue persist

You can refer below link for all the supported operators list, in case any operator is not supported you need to create a custom plugin to support that operation

Also, request you to share your model and script if not shared already so that we can help you better.

Meanwhile, for some common errors and queries please refer to below link:


In my experience it doesn’t. You can run NSightSystems (or probably another profiler of your choice) and see the timeline. I think you can imagine that TensorRT is maximizing GPU usage on each layer, so it wouldn’t have any compute power left for running both branches “at the same time” anyway. Also I think you would need to have separate streams to achieve concurrent execution of layers, which as far as I know, tensorrt is only executing in one stream.

If each branch isn’t very computationally intensive, maybe using cuda graphs will help? Developer Guide :: NVIDIA Deep Learning TensorRT Documentation

You could also try to split your model into two pieces and execute them in separate streams. But again, if each branch has maximized GPU usage, there won’t be any performance difference.