Observation:Same inference speed when running two parallel processes vs running it as threads inside a process
Environment
TensorRT Version: 7.2.3.4
GPU Type: Geforce MX130 (2GB)
Nvidia Driver Version: 465.19.01
CUDA Version: V11.1.105
CUDNN Version: V8
Operating System + Version: Ubuntu 18.04
Python Version (if applicable): python 3.6
TensorFlow Version (if applicable): NA
PyTorch Version (if applicable): 1.8.1+cu111
Baremetal or Container (if container which image + tag): NA
GPU_Utilization_with_threads.xlsx (10.0 KB)
Brief explanation of the experiments available in the attached excel sheet.
- Experiment 1 in excel sheet showed no surprises. Single model taking 110ms with 340 mb GPU memory loading a trt model
- Experiment 2 we loaded the trt model and did the inference on 2 threads using the same preloaded trt model to mimic 2 camera streams inside a process. The inference time shot up to 160ms as there is context switching but the gpu mem remained same.
- Experiment 3 we ran two processes with the camera processing on the main thread itself. Ie 2 processes reading video frames in a while loop. Here the memory doubled as cuda memory was allocated independently in each process, but the inference time was similar to what is seen with threading.
Questions based on the above experiment:
1)Is the recommended approach to doing inference on multiple camera channels using same model is to use threading? Our goal is to achieve maximum parallelism along with reducing the cuda memory overheads?
2)How will the behavior of experiment 3 be when we have GPU segmentation like MiG using Ampere GPUs. I am assuming the process 2 will not have to wait in case it is accessing a different segment of cuda memory?
3) Is Triton Inference server recommended in this context, say we have to process 100 camera streams on a single physical server loading say 10 unique models ie 10 models each running on 10 channels each. Is the same model loaded loaded and reused to avoid the inference overhead?