Running PyTorch CUDA

import torch
from torch import nn
import torchvision

import time

torch.cuda.empty_cache()

DEVICE = torch.device(“cuda”)
print(DEVICE, torch.version, sep=" | ")

model = torchvision.models.resnet18(pretrained=True).to(DEVICE)
model.eval()

inp = torch.rand(1, 3, 224, 224).to(DEVICE)

start = time.time()
out = model(inp)
stop = time.time() - start

print(out.shape, stop, sep=" ")

I try to run this code on CPU and GPU. On CPU I have 2 seconds but on GPU it works too slow (about 12 seconds). Do you have any solution or do you now why on GPU it works so slow?
P.S. I use Jetson Nano Developer Kit 4GB with Jetpack 4.6.1

Hi @vovinsa, after starting a PyTorch program, the first time you allocate/transfer a PyTorch tensor to GPU or run a model on GPU, it will take extra time to initialize CUDA and load all the shared libraries like cuDNN/cuBLAS/ect.

When benchmarking it’s recommended to conduct multiple runs and to ignore the first timing iteration. You can also run the jetson_clocks script beforehand to disable dynamic frequency scaling and to stabilize the timing.

I use GPU and do several runs and the best time on the GPU was 9 seconds and on the processor 1.5. What is Jetson clocks?

After using sudo jetson_clocks I have time on GPU 6-7 seconds, but I think it is also too long

When I run this slightly modified script of yours on Nano using l4t-pytorch:r32.5.0-pth1.7-py3 container:

import torch
from torch import nn
import torchvision

import time

torch.cuda.empty_cache()

DEVICE = torch.device('cuda')
print(DEVICE, torch.__version__, sep=" | ")

model = torchvision.models.resnet18(pretrained=True).to(DEVICE)
model.eval()

inp = torch.rand(1, 3, 224, 224).to(DEVICE)

for i in range(50):
    start = time.time()
    out = model(inp)
    stop = time.time() - start
    print(out.shape, stop, sep=" ")

this is what I get:

cuda | 1.7.0
torch.Size([1, 1000]) 6.532195329666138
torch.Size([1, 1000]) 0.04456067085266113
torch.Size([1, 1000]) 0.028676748275756836
torch.Size([1, 1000]) 0.032257795333862305
torch.Size([1, 1000]) 0.03734779357910156
torch.Size([1, 1000]) 0.021605968475341797
torch.Size([1, 1000]) 0.02182316780090332
torch.Size([1, 1000]) 0.022113561630249023
torch.Size([1, 1000]) 0.021986007690429688
torch.Size([1, 1000]) 0.02157902717590332
torch.Size([1, 1000]) 0.02269721031188965
torch.Size([1, 1000]) 0.02176690101623535
torch.Size([1, 1000]) 0.021875619888305664
torch.Size([1, 1000]) 0.021648883819580078
torch.Size([1, 1000]) 0.021985530853271484
torch.Size([1, 1000]) 0.021978378295898438
torch.Size([1, 1000]) 0.023291587829589844
torch.Size([1, 1000]) 0.02150726318359375
torch.Size([1, 1000]) 0.022224903106689453
torch.Size([1, 1000]) 0.021436214447021484
torch.Size([1, 1000]) 0.02248215675354004
torch.Size([1, 1000]) 0.021712541580200195
torch.Size([1, 1000]) 0.021942615509033203
torch.Size([1, 1000]) 0.02127361297607422
torch.Size([1, 1000]) 0.02282261848449707
torch.Size([1, 1000]) 0.0217134952545166
torch.Size([1, 1000]) 0.021761655807495117
torch.Size([1, 1000]) 0.021343708038330078
torch.Size([1, 1000]) 0.02223944664001465
torch.Size([1, 1000]) 0.022092103958129883
torch.Size([1, 1000]) 0.03938460350036621
torch.Size([1, 1000]) 0.034061431884765625
torch.Size([1, 1000]) 0.03392529487609863
torch.Size([1, 1000]) 0.03402352333068848
torch.Size([1, 1000]) 0.0338597297668457
torch.Size([1, 1000]) 0.03396439552307129
torch.Size([1, 1000]) 0.03467988967895508
torch.Size([1, 1000]) 0.03416728973388672
torch.Size([1, 1000]) 0.03405308723449707
torch.Size([1, 1000]) 0.03409099578857422
torch.Size([1, 1000]) 0.02503824234008789
torch.Size([1, 1000]) 0.04316258430480957
torch.Size([1, 1000]) 0.03392148017883301
torch.Size([1, 1000]) 0.03392672538757324
torch.Size([1, 1000]) 0.033841609954833984
torch.Size([1, 1000]) 0.03403735160827637
torch.Size([1, 1000]) 0.034006357192993164
torch.Size([1, 1000]) 0.0339808464050293
torch.Size([1, 1000]) 0.03402423858642578
torch.Size([1, 1000]) 0.034269094467163086

As expected, the first iteration takes longer, but then the time quickly drops. Are you using the 5W or 10W (MAX-N) power profile?

$ sudo nvpmodel -q
NVPM WARN: fan mode is not set!
NV Power Mode: MAXN
0

I use MAXN mode. And I download PyTorch by this tutorial: Install PyTorch on Jetson Nano - Q-engineering
Yesterday i had time for first iteration 5 seconds and after 5 iterations i have time 3.8 seconds. Should I do more iterations to have a better time?

I don’t know how that article installs it, but I would recommend trying the l4t-pytorch container for a compatible version of JetPack-L4T that you are running and seeing if the performance is different using the container, because those images come with PyTorch/torchvision already installed and tested using the wheels from this post.

I solved the problem. Thanks for help!