Description
When running inference, the python GIL is constantly at 98-100%, which considerably reduces the performance of any other concurrent work in other threads in the same python process, such as reading images, pre-processing, post-processing and application code.
The same model running in pytorch uses only 20-40% of the GIL.
I traced the stack-trace of the thread catching the GIL to point to context.execute_async(…) method.
I suspect that the python C-binding of the context execute_async method, does not release the GIL, although I`m unsure about this, as I don’t have the source code for the python-bindings.
Is this the problem, or is there a different issue causing this?
Can you provide a fix , or a workaround which will allow to use the python API while releasing the GIL?
Environment
TensorRT Version: 7.1.3.0
GPU Type: Xaviar NX
Nvidia Driver Version:
CUDA Version: 0.2.89
CUDNN Version: 0.0.0.180
Operating System + Version:
Python Version (if applicable): 3.6
TensorFlow Version (if applicable):
PyTorch Version (if applicable): 1.6.0 (used only to compare)
Baremetal or Container (if container which image + tag):
Jetpack 4.4.
Relevant Files
Please attach or include links to any models, data, files, or scripts necessary to reproduce your issue. (Github repo, Google Drive, Dropbox, etc.)
Steps To Reproduce
To reproduce:
install latest : GitHub - NVIDIA-AI-IOT/torch2trt: An easy to use PyTorch to TensorRT converter
import time
import os
import torch
import torchvision
from torch2trt import torch2trt,TRTModule
data = torch.randn((1, 3, 224, 224)).cuda().half()
model_pytorch = torchvision.models.resnet18(pretrained=True).cuda().half().eval()
if not os.path.exists('resnet18_trt.pth'):
model_trt = torch2trt(model_pytorch, [data], fp16_mode=True)
output_trt = model_trt(data)
torch.save(model_trt.state_dict(), 'resnet18_trt.pth')
model_trt = TRTModule()
model_trt.load_state_dict(torch.load('resnet18_trt.pth'))
print ('loading complete')
N= 100000
start = time.time()
for i in range(N):
model_trt(data) #98-100% GIL usage
# model_pytorch(data) # 20-40% GIL usage
elapsed = time.time() - start
print (f'completed {N} in {elapsed:.2f} seconds each one is {1000*elapsed/N:.2f} ms')
replace model_trt with model_pytorch to see the differences in the GIL usage (20-40% vs 98-100%)
I measured the GIL usage using py-spy, but any other GIL profiling tool can be used.
If you do want to use py-spy, you can install it on ARM platform with:
sudo apt install curl
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
cargo install py-spy
source $HOME/.cargo/env #you can either logout and login again or do this command
cargo install py-spy
py-spy top --pid <pid> # replace <pid> with the python process id you use for benchmark