Jetson nano slow cuda times with pytorch

jinxedgrimyt · September 11, 2023, 5:28pm

I have a trained PyTorch model saved as a PyTorch PT file (torch.jit.compile; torch.save). This model takes about 3 seconds to load from disk then 2:30 minutes to move to GPU. I believe this is because of some CUDA JIT compiling. I only plan to run this model on this machine is there a way to precompile anything that might need compiled? Also if I load a model to GPU and then load another model and move it to GPU it takes ~ 2 seconds so its not allocating memory for 2 minutes.

AastaLLL · September 12, 2023, 3:16am

Hi,

Could you run your app with the below configure and share the output with us?

$ CUDA_DISABLE_PTX_JIT=1 [command]

Ex.

$ CUDA_DISABLE_PTX_JIT=1 python3 test.py

Thanks.

jinxedgrimyt · September 12, 2023, 5:18pm

using this command did reduce the model load time but only by ~20 seconds it still took 70seconds to load the model then move it to GPU and this same thing will happen when I generate a random tensor and move it the GPU. I even froze my model with PyTorch and saved it while it was loaded on GPU.

AastaLLL · September 13, 2023, 5:27am

Hi,

The flag turns off the JIT compiling, the rest of the time should be used for other procedures.
Have you maximized the device?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

Thanks.

jinxedgrimyt · September 13, 2023, 5:35pm

I already had both of these enabled. For what it’s worth i ran a NvProf and it showed 99% of the time spent initializing a random tensor to .cuda (took ~ 70 seconds) was spent in cudaMalloc. This makes me suspect that the problem is somewhere in the lazy initialization of cuda. Is there anything i can do about this? For instance ive read about driver persistence with cuda, but I read that the Jetson doesn’t support smi.

AastaLLL · September 18, 2023, 8:00am

Hi

CUDA started lazy loading support in 11.8.
So for Nano which is 10.2, import PyTorch will need to load the big CUDA/cuDNN library which can take time.

Thanks.

jinxedgrimyt · September 18, 2023, 5:51pm

But should it really take 70 seconds to move a tensor to CUDA? Some people on this forum are complaining about 2 seconds. I really think there must be something I’m missing if I haven’t clarified yet I’m on Jetpack 4.6.4 L4T 32.7.4 I could only find wheels for 4.6.1 and 4.6.

dusty_nv · September 18, 2023, 6:40pm

@jinxedgrimyt does it take that long every time you perform a CUDA operation in torch, or just the first time? In my experience, the first time you perform an operation on GPU in an app with PyTorch, it takes longer than the rest, because it’s loading the huge amount of kernels that PyTorch has. It’s not actually spending that time copying memory, ect. What @AastaLLL is referring to, is the lazy loading feature should enable PyTorch to only load the kernels that it needs on-demand at runtime, rather than loading every one at startup (most of which probably never get used)

Also the first time you run a model, it will be slower than the rest. For this reason it’s recommended to do a “warmup” iteration before benchmarking.

jinxedgrimyt · September 18, 2023, 8:12pm

I am aware about warmup runs, and I’m aware that first calls often take longer. Upon invocation of a python script it takes ~30 - 70 seconds to load the model (it was taking ~150 before!) after this initial load t does take only fractions of a second to load another similarly size model. At this point the warmup inferences are starting to become a problem as they take longer than loading the model.

here is my current code for testing times:

import os
import time
import torch
import torch.nn as nn                              #Torch neural network
from PIL import Image as PIM         #For image processing

Dir = os.path.dirname(os.path.abspath(__file__))
TrainedPT = Dir + '/TrainedModelFrozen.pt'
Model = None

Dev = torch.device('cuda:0')

def WarmupModel():
    global Model
    global Dev
    print('Warming up model for inferences')

    DummyTensors = [torch.randn(1, 3, 224, 224).pin_memory().to(Dev) for _ in range(3)]  # Generate on GPU

    with torch.no_grad():
        for Tensor in DummyTensors:
            Model(Tensor) #Run inference


def LoadModel(FileName, CudaInit = False):
    global Model

    TotalStart = time.time()

    if CudaInit:
        CudaInitStartTime = time.time()
        print('Initializing Cuda... ', end='')
        torch.cuda.init()
        CudaInitEndTime = time.time()
        print(' Finished')

    print('Loading Model... ', end='')

    ModelLoadStart = time.time()
    Model = torch.jit.load(FileName, map_location=Dev) #Load pre-compiled model
    ModelLoadEnd = time.time()
    Model.eval()
    TotalEnd = time.time()

    print("Loaded")

    print(f'Took - {ModelLoadEnd - ModelLoadStart}s to load model from disk')

    print('Took - ', (TotalEnd - TotalStart), 's to completely load the model\n')

    WarmupTime = time.time()
    WarmupModel()
    WarmupTimeEnd = time.time()

    print(f'Took - {WarmupTimeEnd - WarmupTime}s to warmup model')
    print(f'Took - {(WarmupTimeEnd - WarmupTime) + (TotalEnd - TotalStart)}s to finish all tasks')

LoadModel(TrainedPT, True)

and the output:

Loading Model... Loaded
Took -  0.0735316276550293 s to initialize cuda
Took -  31.957139015197754 s to completely load the model

Warming up model for inferences
Took - 97.31978988647461s to warmup model
Took - 129.27692890167236s to finish all tasks

At this point the model load time is becoming more acceptable to me but the inferencing warmup is still rather brutal. but @dusty_nv when attempting to install the PyTorch docker I get the error “Repository name must be lower case”. What could be the problem?

dusty_nv · September 18, 2023, 9:12pm

What is the container image and command you are trying to run? Docker container names are all lowercase.

Regarding the load/warmup times, as models continue getting bigger it’s not uncommon to have the model’s reside in a server process separate from the client. You can iterate the development of your client faster, and serve multiple clients. You could write your own simple one, or use something like Triton Inference Server.

jinxedgrimyt · September 18, 2023, 9:33pm

I was able to get the docker image working but noticed the performance was the exact same. At this point i just want to shave some time off the inference warmup is there anything i can do for that? I kind of already use a ‘server’ process for my model, it starts at system boot, but we want to use this for demo purposes and it isn’t ideal to have to wait for the model to load and warmup for over 2 mins.

dusty_nv · September 19, 2023, 1:44pm

@jinxedgrimyt that does sound beyond the typical amount of time that I’m used to as well, but then again I don’t use torch.jit and suspect that may have something to do with it. Is it possible for you to just run your model with torch2trt instead? If it works with your model, it’s pretty drop-in with PyTorch APIs. It may take a while to generate the TensorRT engine initially, but you can save that and it will load quickly thereafter.

I also wonder how much of this is disk-I/O bound, and if you are using a slower SD card?

jinxedgrimyt · September 26, 2023, 5:42pm

Sorry for the late reply torch2trt did not give any performance gains either. I do believe it is possible that it could be IO bound as it does seem to go over the 2gb of memory and into swap memory during model loading. Regardless it still takes 30 seconds to load a random 224x224x3 tensor every time I open my test script. This still seems too long.

dusty_nv · September 27, 2023, 12:56am

Are you on Jetson Nano 2GB? IIRC it loads >1GB of pytorch CUDA kernels the first time you perform any operation on GPU (typically that’s allocating a GPU tensor)

system · October 11, 2023, 12:56am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Running PyTorch CUDA Jetson Nano pytorch	8	2116	July 13, 2022
Slow CUDA Loading&Initialisation / GPU Warmup issue Jetson Orin Nano cuda	7	1346	July 21, 2023
How to run pytorch custom inference on Jetson Nano's GPU? Jetson Nano pytorch	4	1174	June 21, 2022
Jetson nano sometimes extremely slow with GPU Jetson Nano cuda , pytorch	7	1113	November 3, 2023
Problem with loading models into cuda device (Jetson Nano) Jetson Nano pytorch	3	764	December 29, 2021
Why is torch.tensor.to('cuda') so slow? Jetson AGX Orin pytorch	5	75	December 9, 2024
Jetsonnano 2gb custom model live inference? Jetson Nano jetson-inference	8	624	May 6, 2022
TrOCR model running slow on Jetson Nano Jetson Nano jetson-inference	7	314	June 20, 2024
High Initial Inference Time with PyTorch and CUDA Jetson TX2 cuda , pytorch	12	166	August 19, 2024
Getting error as Cuda Runtime (invalid argument) Jetson Nano cuda	12	1735	September 25, 2023

Jetson nano slow cuda times with pytorch

Related topics