High Initial Inference Time with PyTorch and CUDA

CheshireLin · July 26, 2024, 2:41am

Hello everyone,

I am working on a deep learning project using PyTorch with CUDA on an embedded system. I have observed that the initial inference takes approximately 4 seconds, while subsequent inferences only take about 0.1 seconds. I have tried pre-initializing CUDA with the following commands:

CUDA initialization

torch.cuda.init()
torch.cuda.set_device(0)
torch.cuda.empty_cache()

However, this does not seem to improve the initial inference time.

AastaLLL · July 26, 2024, 10:14am

Hi,

It’s expected since the initialization will load the CUDA binary (>600MB) which takes time.
This issue is improved by the lazy loading feature in CUDA 11.8.

However, CUDA 11.8 is not available for the TX2.
Thanks.

CheshireLin · July 29, 2024, 3:20am

Thank you very much for your previous answers, but I am also experiencing the following performance issues in my project and would appreciate your help and suggestions. Here are the specific details:

1. Long Library Loading Time

Loading the following libraries takes a total of 7.42 seconds:

import cv2
import torch
import numpy as np
from torch.nn import functional as F
from kenexs_dl.deepLearning_semseg.models.UNetv2 import UNetv2

2. Long Model Loading and Initialization Time

In the seg_create step, loading model parameters and initializing the model takes the following times:

state = torch.load(model_path, map_location=device): 8 seconds
model = UNetv2(classes_num): 2.45 seconds

These steps are taking a lot of time. Are there any ways to optimize them? What specific methods can I use to improve these times?

AastaLLL · July 29, 2024, 11:05am

Hi,

Could you share more about your use case?
For segmentation, you just need to load the model once and can apply the inference repetitively.

Thanks.

CheshireLin · July 29, 2024, 1:18pm

To avoid sharing my specific examples publicly, could we continue this discussion via email? Please let me know the best way to proceed.

BenSung · July 30, 2024, 10:28am

Hi AastaLLL, can you tell us more details about the 8 seconds for “touch.load”? What is it doing that takes so long? Can we use multithreading etc to reduce its time?

AastaLLL · July 31, 2024, 6:35am

Hi,

The loading call includes some initialization, ex. import CUDA binary into the memory.
To check this, you can try to add a simple PyTorch CUDA code at the beginning.

Thanks.

BenSung · August 2, 2024, 12:38pm

Hi AastaLLL,

Could you tell us more details about “add a simple PyTorch CUDA code at the beginning”? Do you mean that, if I do the CUDA computation at beginning, the time consumption will transfer there, instead of at “touch.load”? So the “CUDA binary” as you mentioned is something like preparing CUDA environment (maybe to say PyTorch CUDA environment will be more accurate), and it is required for most CUDA code(or to say PyTorch CUDA environment)?

CheshireLin · August 3, 2024, 5:43am

Hi AastaLL,

Thank you for your previous suggestions. I have a follow-up question regarding the time it takes to load the model and perform inference using PyTorch.

I used torch.cuda.init() to initialize CUDA beforehand, but I still see that the time taken for torch.load(model_path, map_location=device) and the inference time (model(img)) hasn’t significantly decreased.

Here is the code I’m using to measure these times:

import torch
import torchvision.models as models
import time

# CUDA initialization
torch.cuda.init()
torch.cuda.set_device(0)
torch.cuda.empty_cache()
print("CUDA initialization complete.")

# Measure model loading time
start_load = time.time()
model = torch.load('model_path.pth', map_location='cuda:0')
end_load = time.time()
print(f"Model loading time: {end_load - start_load:.2f} seconds")

# Set model to evaluation mode
model.eval()


# Measure inference time
start_pred = time.time()
    preds = model(img)
end_pred = time.time()
print(f"Inference time: {end_pred - start_pred:.2f} seconds")

Could you tell me：

1.Which specific lines of code are responsible for the model loading process in PyTorch?
2.Do you have any examples or demos that show best practices for separating business logic from model loading and inference code?

Understanding these question is very important for me.

Thank you for your assistance!

AastaLLL · August 15, 2024, 6:27am

Hi,

Could you give the below sample a try?

We test a ResNet model loading on JetPack 6 + Orin.
The loading decreased from 0.22 seconds to 0.16 seconds with a toy cuda sample beforehand.

...

x = torch.tensor([1., 2.], device=torch.device('cuda'))
y = torch.tensor([1., 2.], device=torch.device('cuda'))
z = x + y

# Measure model loading time
start_load = time.time()
model = torch.load('/home/nvidia/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth', map_location='cuda:0')
end_load = time.time()
print(f"Model loading time: {end_load - start_load:.2f} seconds")

Thanks.

CheshireLin · August 15, 2024, 11:42am

Hi AastaLL
Thank you for your previous suggestions. I implemented a function seg_create(model_path) that loads a model and includes several timing measurements:

CUDA Test: I perform a basic CUDA operation (tensor([1, 2], device='cuda:0') + tensor([1, 2], device='cuda:0')) to ensure that CUDA is functioning correctly, which outputs tensor([2, 4], device='cuda:0').
torch.load Time: The model file is loaded using torch.load(model_path, map_location=device), which takes approximately 0.44 s.
Total Other Time: After loading the state, additional time is spent on tasks like initializing the UNetv2 model and loading the state dictionary. This part of the process takes about 2.13 s.
seg_create Execution Time: The entire seg_create function takes around 10.08 s to complete.

While the torch.load time is relatively fast, the bulk of the time is spent on other steps within seg_create, particularly initializing the model and setting it up for inference. Is there any way to optimize these other steps to reduce the overall execution time of the function?

AastaLLL · August 19, 2024, 8:46am

There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks

Hi,

Could you share the “initialize the model” and “set it up for inference” code for us to check?
Is there any memcopy function like moving data from CPU to GPU?

Thanks.

system · September 11, 2024, 4:19am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Long Model Loading Time in PyTorch with CUDA Jetson TX2 cuda , pytorch	2	47	July 26, 2024
Jetson nano slow cuda times with pytorch Jetson Nano cuda , pytorch	14	1064	October 11, 2023
Slow CUDA Loading&Initialisation / GPU Warmup issue Jetson Orin Nano cuda	7	1346	July 21, 2023
Why mobilenetv2 inference time takes too much time? Jetson AGX Xavier neural-network-framework	4	2333	October 18, 2021
Jetson AGX Xavier: slow inference using CUDA and PyTorch Jetson AGX Xavier cuda , pytorch	4	1631	October 18, 2021
Why my inference time is so long when using trtexec - FP16? Jetson TX2 jetson-inference	4	1960	October 18, 2021
Running PyTorch CUDA Jetson Nano pytorch	8	2116	July 13, 2022
YOLOv8 Python Script has really high inference time due unused GPU Memory Jetson Orin NX cuda , pytorch , cudnn	4	637	March 20, 2024
How to run pytorch custom inference on Jetson Nano's GPU? Jetson Nano pytorch	4	1174	June 21, 2022
High Latency Variance During Inference CUDA Programming and Performance cuda , python , deep-learning , windows-driver , computer-vision	5	579	April 24, 2024

High Initial Inference Time with PyTorch and CUDA

CUDA initialization

1. Long Library Loading Time

2. Long Model Loading and Initialization Time

Related topics