100% python GIL usage when using the context.execute_async through python API

Description

When running inference, the python GIL is constantly at 98-100%, which considerably reduces the performance of any other concurrent work in other threads in the same python process, such as reading images, pre-processing, post-processing and application code.
The same model running in pytorch uses only 20-40% of the GIL.

I traced the stack-trace of the thread catching the GIL to point to context.execute_async(…) method.

I suspect that the python C-binding of the context execute_async method, does not release the GIL, although I`m unsure about this, as I don’t have the source code for the python-bindings.

Is this the problem, or is there a different issue causing this?
Can you provide a fix , or a workaround which will allow to use the python API while releasing the GIL?

Environment

TensorRT Version: 7.1.3.0
GPU Type: Xaviar NX
Nvidia Driver Version:
CUDA Version: 0.2.89
CUDNN Version: 0.0.0.180
Operating System + Version:
Python Version (if applicable): 3.6
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.6.0 (used only to compare)
Baremetal or Container (if container which image + tag):
Jetpack 4.4.

Relevant Files

Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)

Steps To Reproduce

To reproduce:
install latest : https://github.com/NVIDIA-AI-IOT/torch2trt

import time
import os
import torch
import torchvision
from torch2trt import torch2trt,TRTModule

data = torch.randn((1, 3, 224, 224)).cuda().half()
model_pytorch = torchvision.models.resnet18(pretrained=True).cuda().half().eval()
if not os.path.exists('resnet18_trt.pth'):
    model_trt = torch2trt(model_pytorch, [data], fp16_mode=True)
    output_trt = model_trt(data)
    torch.save(model_trt.state_dict(), 'resnet18_trt.pth')
model_trt = TRTModule()
model_trt.load_state_dict(torch.load('resnet18_trt.pth'))
print ('loading complete')
N= 100000
start = time.time()
for i in range(N):
    model_trt(data)  #98-100% GIL usage
    #  model_pytorch(data) # 20-40% GIL usage
elapsed = time.time() - start
print (f'completed {N} in {elapsed:.2f} seconds each one is {1000*elapsed/N:.2f} ms')

replace model_trt with model_pytorch to see the differences in the GIL usage (20-40% vs 98-100%)
I measured the GIL usage using py-spy, but any other GIL profiling tool can be used.
If you do want to use py-spy, you can install it on ARM platform with:

sudo apt install curl  
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
cargo install py-spy
source $HOME/.cargo/env  #you can either logout and login again or do this command
cargo install py-spy

py-spy top --pid <pid>  # replace <pid> with the python process id you use for benchmark