Jetson nano slow cuda times with pytorch

I have a trained PyTorch model saved as a PyTorch PT file (torch.jit.compile; This model takes about 3 seconds to load from disk then 2:30 minutes to move to GPU. I believe this is because of some CUDA JIT compiling. I only plan to run this model on this machine is there a way to precompile anything that might need compiled? Also if I load a model to GPU and then load another model and move it to GPU it takes ~ 2 seconds so its not allocating memory for 2 minutes.


Could you run your app with the below configure and share the output with us?

$ CUDA_DISABLE_PTX_JIT=1 [command]




using this command did reduce the model load time but only by ~20 seconds it still took 70seconds to load the model then move it to GPU and this same thing will happen when I generate a random tensor and move it the GPU. I even froze my model with PyTorch and saved it while it was loaded on GPU.


The flag turns off the JIT compiling, the rest of the time should be used for other procedures.
Have you maximized the device?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks


I already had both of these enabled. For what it’s worth i ran a NvProf and it showed 99% of the time spent initializing a random tensor to .cuda (took ~ 70 seconds) was spent in cudaMalloc. This makes me suspect that the problem is somewhere in the lazy initialization of cuda. Is there anything i can do about this? For instance ive read about driver persistence with cuda, but I read that the Jetson doesn’t support smi.


CUDA started lazy loading support in 11.8.
So for Nano which is 10.2, import PyTorch will need to load the big CUDA/cuDNN library which can take time.


But should it really take 70 seconds to move a tensor to CUDA? Some people on this forum are complaining about 2 seconds. I really think there must be something I’m missing if I haven’t clarified yet I’m on Jetpack 4.6.4 L4T 32.7.4 I could only find wheels for 4.6.1 and 4.6.

@jinxedgrimyt does it take that long every time you perform a CUDA operation in torch, or just the first time? In my experience, the first time you perform an operation on GPU in an app with PyTorch, it takes longer than the rest, because it’s loading the huge amount of kernels that PyTorch has. It’s not actually spending that time copying memory, ect. What @AastaLLL is referring to, is the lazy loading feature should enable PyTorch to only load the kernels that it needs on-demand at runtime, rather than loading every one at startup (most of which probably never get used)

Also the first time you run a model, it will be slower than the rest. For this reason it’s recommended to do a “warmup” iteration before benchmarking.

I am aware about warmup runs, and I’m aware that first calls often take longer. Upon invocation of a python script it takes ~30 - 70 seconds to load the model (it was taking ~150 before!) after this initial load t does take only fractions of a second to load another similarly size model. At this point the warmup inferences are starting to become a problem as they take longer than loading the model.

here is my current code for testing times:

import os
import time
import torch
import torch.nn as nn                              #Torch neural network
from PIL import Image as PIM         #For image processing

Dir = os.path.dirname(os.path.abspath(__file__))
TrainedPT = Dir + '/'
Model = None

Dev = torch.device('cuda:0')

def WarmupModel():
    global Model
    global Dev
    print('Warming up model for inferences')

    DummyTensors = [torch.randn(1, 3, 224, 224).pin_memory().to(Dev) for _ in range(3)]  # Generate on GPU

    with torch.no_grad():
        for Tensor in DummyTensors:
            Model(Tensor) #Run inference

def LoadModel(FileName, CudaInit = False):
    global Model

    TotalStart = time.time()

    if CudaInit:
        CudaInitStartTime = time.time()
        print('Initializing Cuda... ', end='')
        CudaInitEndTime = time.time()
        print(' Finished')

    print('Loading Model... ', end='')

    ModelLoadStart = time.time()
    Model = torch.jit.load(FileName, map_location=Dev) #Load pre-compiled model
    ModelLoadEnd = time.time()
    TotalEnd = time.time()


    print(f'Took - {ModelLoadEnd - ModelLoadStart}s to load model from disk')

    print('Took - ', (TotalEnd - TotalStart), 's to completely load the model\n')

    WarmupTime = time.time()
    WarmupTimeEnd = time.time()

    print(f'Took - {WarmupTimeEnd - WarmupTime}s to warmup model')
    print(f'Took - {(WarmupTimeEnd - WarmupTime) + (TotalEnd - TotalStart)}s to finish all tasks')

LoadModel(TrainedPT, True)

and the output:

Loading Model... Loaded
Took -  0.0735316276550293 s to initialize cuda
Took -  31.957139015197754 s to completely load the model

Warming up model for inferences
Took - 97.31978988647461s to warmup model
Took - 129.27692890167236s to finish all tasks

At this point the model load time is becoming more acceptable to me but the inferencing warmup is still rather brutal. but @dusty_nv when attempting to install the PyTorch docker I get the error “Repository name must be lower case”. What could be the problem?

What is the container image and command you are trying to run? Docker container names are all lowercase.

Regarding the load/warmup times, as models continue getting bigger it’s not uncommon to have the model’s reside in a server process separate from the client. You can iterate the development of your client faster, and serve multiple clients. You could write your own simple one, or use something like Triton Inference Server.

I was able to get the docker image working but noticed the performance was the exact same. At this point i just want to shave some time off the inference warmup is there anything i can do for that? I kind of already use a ‘server’ process for my model, it starts at system boot, but we want to use this for demo purposes and it isn’t ideal to have to wait for the model to load and warmup for over 2 mins.

@jinxedgrimyt that does sound beyond the typical amount of time that I’m used to as well, but then again I don’t use torch.jit and suspect that may have something to do with it. Is it possible for you to just run your model with torch2trt instead? If it works with your model, it’s pretty drop-in with PyTorch APIs. It may take a while to generate the TensorRT engine initially, but you can save that and it will load quickly thereafter.

I also wonder how much of this is disk-I/O bound, and if you are using a slower SD card?

Sorry for the late reply torch2trt did not give any performance gains either. I do believe it is possible that it could be IO bound as it does seem to go over the 2gb of memory and into swap memory during model loading. Regardless it still takes 30 seconds to load a random 224x224x3 tensor every time I open my test script. This still seems too long.

Are you on Jetson Nano 2GB? IIRC it loads >1GB of pytorch CUDA kernels the first time you perform any operation on GPU (typically that’s allocating a GPU tensor)

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.