Description
Is it possible to divide workload between GPU and Host CPU using TensorRT?
I have my GPU installed on a relatively powerful system which is idle for most part during GPU execution. Can I make use of the CPU power by dividing some steps to take place on the CPU and some on the GPU? like some sort of pre-conversion of the input data that might speed up things?
My current understanding is that no matter what precision or input format I compile my trt for, the input given to tensorRT is directly given to GPU without using any CPU power. Is this understanding correct?
Environment
TensorRT Version: 8003
GPU Type: NVIDIA T4
Nvidia Driver Version: 450.51.05
CUDA Version: 11.0
CUDNN Version:
Operating System + Version: Ubuntu 18.04
Python Version (if applicable):
TensorFlow Version (if applicable):
PyTorch Version (if applicable):
Baremetal or Container (if container which image + tag):