Slow CUDA Loading&Initialisation / GPU Warmup issue

I am attempting to get various Cuda/Pytorch based inference tools running on my Jetson Orin Nano; for example including openai Whisper STT models.
What I am finding is that when I first initialise something (i.e. attempt to execute a GPU based calculation using Pytorch):

  1. It always returns an error related to NaN values in some array.
  2. It takes a long time to do the initialisation and return this error (about 2 minutes).

I have investigated this from several angles but it seems like a very low-level issue that is not arising from my code. I can set up a script that does literally the same calculation twice in a row; first time it will fail and take >100 seconds, second time it will take <1 second and succeed. Note if I then run the script again the first time around it will take ~10 seconds (corresponding for time required to load the model into GPU), but this is much less than the 100 second “warmup” that happens first time.
Broadly there are three possible lengths of time it can take for the same function to run:

  • 2 minutes (warm-up run time first time I run a GPU function for each boot of the system) which returns an error.
  • ~10 second (first time I run my function in a given python script post warm-up; this delay is expected and I think represents model loading into GPU memory)
  • <1 second (any subsequent times I run function in my python script as now the model is already in GPU memory)

My current solution is to have a dummy “warmup GPU” function at the top of my code; this performs a basic PyTorch calculation on the GPU, which takes a few minutes and I catch the error then continue. Subsequently I can use torch with no delay (and no error), until I need to re-boot the entire system (or docker container). So it seems the GPU warmup persists between running different python codebases (i.e. once one .py file has called the GPU all subsequent calls will work), until such a time as I restart my docker container (or reboot the whole system) at which point I need to do the GPU warmup again.

Does anyone have any idea why this is happening, or how I can mitigate it?

To investigate a bit further I have used cProfile/pstats to look at what function calls are actually slowing things down.
Below what I am presenting are the top time-using functions run by my code. The first set is when I run my code after a fresh boot. The second set is when I run identical code immediately afterward (i.e. the second time I run my code on a fresh boot). You can see the offender is the load_tensor which is adding ~53 seconds. Notably this is the function that fails when I run the first time (without catching the error) as it attempts to load a tensor of NaNs and dies; possible then this 53 second delay is how long it takes to fail and recover from that fail. Thus any method to fix that “first time fail” would fix my problem, or another fix might be to make it fail and recover more quickly (adjust some variety of timeout?).

First time code runs:
ncalls tottime percall cumtime percall filename:lineno(function)
245 57.871 0.236 57.871 0.236 {method ‘copy_’ of ‘torch.C.StorageBase’ objects}
4425 4.029 0.001 62.180 0.014 /usr/local/lib/python3.8/dist-packages/torch/serialization.py:1095(load_tensor)
6 3.882 0.647 3.882 0.647 {built-in method torch.conv1d}
871 1.587 0.002 1.587 0.002 {method 'uniform
’ of ‘torch._C._TensorBase’ objects}
24045 1.532 0.000 1.532 0.000 {method ‘to’ of ‘torch._C._TensorBase’ objects}
225 1.187 0.005 1.187 0.005 /usr/local/lib/python3.8/dist-packages/whisper/decoding.py:430(apply)

Second time identical code runs:
ncalls tottime percall cumtime percall filename:lineno(function)
4425 4.529 0.001 6.903 0.002 /usr/local/lib/python3.8/dist-packages/torch/serialization.py:1095(load_tensor)
6 3.926 0.654 3.926 0.654 {built-in method torch.conv1d}
245 2.094 0.009 2.094 0.009 {method ‘copy_’ of ‘torch.C.StorageBase’ objects}
871 1.626 0.002 1.626 0.002 {method 'uniform
’ of ‘torch._C._TensorBase’ objects}
24045 1.609 0.000 1.609 0.000 {method ‘to’ of ‘torch._C._TensorBase’ objects}
225 1.231 0.005 1.231 0.005 /usr/local/lib/python3.8/dist-packages/whisper/decoding.py:430(apply)

Hi,

Which PyTorch package or CUDA are you using?

A possible cause is that the package doesn’t build with the Orin GPU (sm_87).
It will trigger JIT compiling when the first function call and take some time to finish.

Thanks.

Currently running Cuda 11.4 and torch 2.0.0a0+ec3941ad.nv23.02
Is there some effective fix to avoid this slow compiling?

For example, some mechanism to compile and prep my docker container during build so it doesn’t need to be done every time we re-run the container.

Here is a minimal working example that replicates it, note the torch.load line takes ~60 seconds first time this code is run; second time it is run <1 second.


import torch
import torch.nn as nn

# Define a simple model
model = nn.Linear(10, 10)
model_path = '/tmp/checkpoint.pth'
# Save the model
torch.save(model.state_dict(), model_path)
checkpoint_file = model_path
device = torch.device("cuda")
fp=open(checkpoint_file, "rb")

checkpoint = torch.load(fp, map_location=device)

del checkpoint_file

@AastaLLL any ideas?

Hi,

Sorry for the late update.

Could you analyze the application with a profiler first (ex. Nsight System)?
Our prebuilt has been built with the Orin GPU architecture so JIT is not expected.

Another possible cause of the initial latency is loading the library.
Getting a profiling output can help us narrow down the cause.

More, have you tried to maximize the device’s performance?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.