100% python GIL usage when using the context.execute_async through python API


When running inference, the python GIL is constantly at 98-100%, which considerably reduces the performance of any other concurrent work in other threads in the same python process, such as reading images, pre-processing, post-processing and application code.
The same model running in pytorch uses only 20-40% of the GIL.

I traced the stack-trace of the thread catching the GIL to point to context.execute_async(…) method.

I suspect that the python C-binding of the context execute_async method, does not release the GIL, although I`m unsure about this, as I don’t have the source code for the python-bindings.

Is this the problem, or is there a different issue causing this?
Can you provide a fix , or a workaround which will allow to use the python API while releasing the GIL?


TensorRT Version:
GPU Type: Xaviar NX
Nvidia Driver Version:
CUDA Version: 0.2.89
CUDNN Version:
Operating System + Version:
Python Version (if applicable): 3.6
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.6.0 (used only to compare)
Baremetal or Container (if container which image + tag):
Jetpack 4.4.

To reproduce:
install latest : GitHub - NVIDIA-AI-IOT/torch2trt: An easy to use PyTorch to TensorRT converter

import time
import os
import torch
import torchvision
from torch2trt import torch2trt,TRTModule

data = torch.randn((1, 3, 224, 224)).cuda().half()
model_pytorch = torchvision.models.resnet18(pretrained=True).cuda().half().eval()
if not os.path.exists('resnet18_trt.pth'):
    model_trt = torch2trt(model_pytorch, [data], fp16_mode=True)
    output_trt = model_trt(data)
    torch.save(model_trt.state_dict(), 'resnet18_trt.pth')
model_trt = TRTModule()
print ('loading complete')
N= 100000
start = time.time()
for i in range(N):
    model_trt(data)  #98-100% GIL usage
    #  model_pytorch(data) # 20-40% GIL usage
elapsed = time.time() - start
print (f'completed {N} in {elapsed:.2f} seconds each one is {1000*elapsed/N:.2f} ms')

replace model_trt with model_pytorch to see the differences in the GIL usage (20-40% vs 98-100%)
I measured the GIL usage using py-spy, but any other GIL profiling tool can be used.
If you do want to use py-spy, you can install it on ARM platform with:

sudo apt install curl  
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
cargo install py-spy
source $HOME/.cargo/env  #you can either logout and login again or do this command
cargo install py-spy

py-spy top --pid <pid>  # replace <pid> with the python process id you use for benchmark