I am working on a deep learning project using PyTorch with CUDA on an embedded system. I have observed that the initial inference takes approximately 4 seconds, while subsequent inferences only take about 0.1 seconds. I have tried pre-initializing CUDA with the following commands:
It’s expected since the initialization will load the CUDA binary (>600MB) which takes time.
This issue is improved by the lazy loading feature in CUDA 11.8.
However, CUDA 11.8 is not available for the TX2.
Thanks.
Thank you very much for your previous answers, but I am also experiencing the following performance issues in my project and would appreciate your help and suggestions. Here are the specific details:
1. Long Library Loading Time
Loading the following libraries takes a total of 7.42 seconds:
import cv2
import torch
import numpy as np
from torch.nn import functional as F
from kenexs_dl.deepLearning_semseg.models.UNetv2 import UNetv2
2. Long Model Loading and Initialization Time
In the seg_create step, loading model parameters and initializing the model takes the following times:
state = torch.load(model_path, map_location=device): 8 seconds
model = UNetv2(classes_num): 2.45 seconds
These steps are taking a lot of time. Are there any ways to optimize them? What specific methods can I use to improve these times?
Hi AastaLLL, can you tell us more details about the 8 seconds for “touch.load”? What is it doing that takes so long? Can we use multithreading etc to reduce its time?
The loading call includes some initialization, ex. import CUDA binary into the memory.
To check this, you can try to add a simple PyTorch CUDA code at the beginning.
Could you tell us more details about “add a simple PyTorch CUDA code at the beginning”? Do you mean that, if I do the CUDA computation at beginning, the time consumption will transfer there, instead of at “touch.load”? So the “CUDA binary” as you mentioned is something like preparing CUDA environment (maybe to say PyTorch CUDA environment will be more accurate), and it is required for most CUDA code(or to say PyTorch CUDA environment)?
Thank you for your previous suggestions. I have a follow-up question regarding the time it takes to load the model and perform inference using PyTorch.
I used torch.cuda.init() to initialize CUDA beforehand, but I still see that the time taken for torch.load(model_path, map_location=device) and the inference time (model(img)) hasn’t significantly decreased.
Here is the code I’m using to measure these times:
import torch
import torchvision.models as models
import time
# CUDA initialization
torch.cuda.init()
torch.cuda.set_device(0)
torch.cuda.empty_cache()
print("CUDA initialization complete.")
# Measure model loading time
start_load = time.time()
model = torch.load('model_path.pth', map_location='cuda:0')
end_load = time.time()
print(f"Model loading time: {end_load - start_load:.2f} seconds")
# Set model to evaluation mode
model.eval()
# Measure inference time
start_pred = time.time()
preds = model(img)
end_pred = time.time()
print(f"Inference time: {end_pred - start_pred:.2f} seconds")
Could you tell me:
1.Which specific lines of code are responsible for the model loading process in PyTorch?
2.Do you have any examples or demos that show best practices for separating business logic from model loading and inference code?
Understanding these question is very important for me.
Thank you for your assistance!
We test a ResNet model loading on JetPack 6 + Orin.
The loading decreased from 0.22 seconds to 0.16 seconds with a toy cuda sample beforehand.
...
x = torch.tensor([1., 2.], device=torch.device('cuda'))
y = torch.tensor([1., 2.], device=torch.device('cuda'))
z = x + y
# Measure model loading time
start_load = time.time()
model = torch.load('/home/nvidia/.cache/torch/hub/checkpoints/resnet50-11ad3fa6.pth', map_location='cuda:0')
end_load = time.time()
print(f"Model loading time: {end_load - start_load:.2f} seconds")
Hi AastaLL
Thank you for your previous suggestions. I implemented a function seg_create(model_path) that loads a model and includes several timing measurements:
CUDA Test: I perform a basic CUDA operation (tensor([1, 2], device='cuda:0') + tensor([1, 2], device='cuda:0')) to ensure that CUDA is functioning correctly, which outputs tensor([2, 4], device='cuda:0').
torch.load Time: The model file is loaded using torch.load(model_path, map_location=device), which takes approximately 0.44 s.
Total Other Time: After loading the state, additional time is spent on tasks like initializing the UNetv2 model and loading the state dictionary. This part of the process takes about 2.13 s.
seg_create Execution Time: The entire seg_create function takes around 10.08 s to complete.
While the torch.load time is relatively fast, the bulk of the time is spent on other steps within seg_create, particularly initializing the model and setting it up for inference. Is there any way to optimize these other steps to reduce the overall execution time of the function?
There is no update from you for a period, assuming this is not an issue any more.
Hence we are closing this topic. If need further support, please open a new one.
Thanks
Hi,
Could you share the “initialize the model” and “set it up for inference” code for us to check?
Is there any memcopy function like moving data from CPU to GPU?