Slow CUDA Loading&Initialisation / GPU Warmup issue

HS131231 · July 6, 2023, 4:26pm

I am attempting to get various Cuda/Pytorch based inference tools running on my Jetson Orin Nano; for example including openai Whisper STT models.
What I am finding is that when I first initialise something (i.e. attempt to execute a GPU based calculation using Pytorch):

It always returns an error related to NaN values in some array.
It takes a long time to do the initialisation and return this error (about 2 minutes).

I have investigated this from several angles but it seems like a very low-level issue that is not arising from my code. I can set up a script that does literally the same calculation twice in a row; first time it will fail and take >100 seconds, second time it will take <1 second and succeed. Note if I then run the script again the first time around it will take ~10 seconds (corresponding for time required to load the model into GPU), but this is much less than the 100 second “warmup” that happens first time.
Broadly there are three possible lengths of time it can take for the same function to run:

2 minutes (warm-up run time first time I run a GPU function for each boot of the system) which returns an error.
~10 second (first time I run my function in a given python script post warm-up; this delay is expected and I think represents model loading into GPU memory)
<1 second (any subsequent times I run function in my python script as now the model is already in GPU memory)

My current solution is to have a dummy “warmup GPU” function at the top of my code; this performs a basic PyTorch calculation on the GPU, which takes a few minutes and I catch the error then continue. Subsequently I can use torch with no delay (and no error), until I need to re-boot the entire system (or docker container). So it seems the GPU warmup persists between running different python codebases (i.e. once one .py file has called the GPU all subsequent calls will work), until such a time as I restart my docker container (or reboot the whole system) at which point I need to do the GPU warmup again.

Does anyone have any idea why this is happening, or how I can mitigate it?

HS131231 · July 6, 2023, 5:07pm

To investigate a bit further I have used cProfile/pstats to look at what function calls are actually slowing things down.
Below what I am presenting are the top time-using functions run by my code. The first set is when I run my code after a fresh boot. The second set is when I run identical code immediately afterward (i.e. the second time I run my code on a fresh boot). You can see the offender is the load_tensor which is adding ~53 seconds. Notably this is the function that fails when I run the first time (without catching the error) as it attempts to load a tensor of NaNs and dies; possible then this 53 second delay is how long it takes to fail and recover from that fail. Thus any method to fix that “first time fail” would fix my problem, or another fix might be to make it fail and recover more quickly (adjust some variety of timeout?).

First time code runs:
ncalls tottime percall cumtime percall filename:lineno(function)
245 57.871 0.236 57.871 0.236 {method ‘copy_’ of ‘torch.C.StorageBase’ objects}
4425 4.029 0.001 62.180 0.014 /usr/local/lib/python3.8/dist-packages/torch/serialization.py:1095(load_tensor)
6 3.882 0.647 3.882 0.647 {built-in method torch.conv1d}
871 1.587 0.002 1.587 0.002 {method 'uniform’ of ‘torch._C._TensorBase’ objects}
24045 1.532 0.000 1.532 0.000 {method ‘to’ of ‘torch._C._TensorBase’ objects}
225 1.187 0.005 1.187 0.005 /usr/local/lib/python3.8/dist-packages/whisper/decoding.py:430(apply)

Second time identical code runs:
ncalls tottime percall cumtime percall filename:lineno(function)
4425 4.529 0.001 6.903 0.002 /usr/local/lib/python3.8/dist-packages/torch/serialization.py:1095(load_tensor)
6 3.926 0.654 3.926 0.654 {built-in method torch.conv1d}
245 2.094 0.009 2.094 0.009 {method ‘copy_’ of ‘torch.C.StorageBase’ objects}
871 1.626 0.002 1.626 0.002 {method 'uniform’ of ‘torch._C._TensorBase’ objects}
24045 1.609 0.000 1.609 0.000 {method ‘to’ of ‘torch._C._TensorBase’ objects}
225 1.231 0.005 1.231 0.005 /usr/local/lib/python3.8/dist-packages/whisper/decoding.py:430(apply)

AastaLLL · July 7, 2023, 5:39am

Hi,

Which PyTorch package or CUDA are you using?

A possible cause is that the package doesn’t build with the Orin GPU (sm_87).
It will trigger JIT compiling when the first function call and take some time to finish.

Thanks.

HS131231 · July 9, 2023, 2:41pm

Currently running Cuda 11.4 and torch 2.0.0a0+ec3941ad.nv23.02
Is there some effective fix to avoid this slow compiling?

For example, some mechanism to compile and prep my docker container during build so it doesn’t need to be done every time we re-run the container.

HS131231 · July 9, 2023, 4:44pm

Here is a minimal working example that replicates it, note the torch.load line takes ~60 seconds first time this code is run; second time it is run <1 second.


import torch
import torch.nn as nn

# Define a simple model
model = nn.Linear(10, 10)
model_path = '/tmp/checkpoint.pth'
# Save the model
torch.save(model.state_dict(), model_path)
checkpoint_file = model_path
device = torch.device("cuda")
fp=open(checkpoint_file, "rb")

checkpoint = torch.load(fp, map_location=device)

del checkpoint_file

HS131231 · July 18, 2023, 5:37pm

@AastaLLL any ideas?

AastaLLL · July 21, 2023, 5:45am

Hi,

Sorry for the late update.

Could you analyze the application with a profiler first (ex. Nsight System)?
Our prebuilt has been built with the Orin GPU architecture so JIT is not expected.

Another possible cause of the initial latency is loading the library.
Getting a profiling output can help us narrow down the cause.

More, have you tried to maximize the device’s performance?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

system · August 16, 2023, 2:47am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Jetson nano slow cuda times with pytorch Jetson Nano cuda , pytorch	14	1150	October 11, 2023
Running PyTorch CUDA Jetson Nano pytorch	8	2162	July 13, 2022
Problem with Torch Computing Jetson Orin NX cuda , tensorflow , python	4	33	February 27, 2025
How to run pytorch custom inference on Jetson Nano's GPU? Jetson Nano pytorch	4	1204	June 21, 2022
Nvidia torch + cuda produces only NAN on CPU Jetson AGX Orin cuda , nvbugs , pytorch	11	2137	May 31, 2023
Jetson Nano slow CUDA performance vs CPU Jetson Nano cuda , generative_ai	6	919	February 26, 2024
Strange jumping results on FPS and inference time Jetson Nano	9	1210	October 18, 2021
Run pytorch custom inference on Jetson Nano’s GPU but it stop Jetson Nano cuda , pytorch	5	45	March 26, 2025
No NVIDIA GPU available or detected on Nvidia Jetson Orin Nano Jetson Orin Nano cuda , pytorch	16	2327	June 17, 2024
Jetson nano sometimes extremely slow with GPU Jetson Nano cuda , pytorch	7	1211	November 3, 2023

Slow CUDA Loading&Initialisation / GPU Warmup issue

Related topics