Jetson Nano slow CUDA performance vs CPU

robnewport · February 9, 2024, 8:35am

I have a Jetson Nano 4GB with a 32GB SD card running a vanilla OS install and a 65 watt micro usb power supply. I installed CUDA (10.2) versions of PyTorch (1.10.0a0+git36449ea) and transformers (4.18.0). I also installed jtop to see the GPU bar move when generate an inference.

I am loading T5 Flan small and getting OK speeds running simple inputs. For example, I would ask “who landed on the moon” and in 31 seconds, it returns “astronaut.”

device = 'cuda:0'
inputs = tokenizer("Who landed on the moon?", return_tensors="pt").to(device)

However, the problem is that the Jetson Nano CPU is faster. When I change the device to ‘cpu’ e.g., device = 'cuda:0' It speeds up to about 27 seconds.

I then decided to test the same script on a bare bones 4GB Z3580 and was able to get speeds of about 12 seconds. This board is 1/3 the price of a Jetson Nano.

My question is, what am I doing wrong, if anything? I already tried some light quantization with torch_dtype=torch.float16 but that did little. Please see my simple code below:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import time
import torch

flan_version = “google/flan-t5-small”
device = ‘cuda:0’
#device = ‘cpu’

start = time.time()

model = AutoModelForSeq2SeqLM.from_pretrained(flan_version,torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(flan_version)

inputs = tokenizer(“Who landed on the moon?”, return_tensors=“pt”).to(device)
outputs = model.to(‘cuda:0’).generate(inputs.input_ids.to(device))
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(response)

dusty_nv · February 9, 2024, 3:04pm

Hi @robnewport, can you try timing this in a loop running the model for multiple iterations? The very first run after the process starts, it typically takes longer to load all the CUDA kernels, memory, ect. So that first run is typically discarded as a ‘warmup’.

robnewport · February 10, 2024, 3:34am

Hey @dusty_nv thank you for that tip, I have not tried that yet and now that you mention it, I can see how it would make sense. I’ll follow up with some posted times when I get back into the office.

I also had another thought – the transformers version on the Jetson Nano is 4.18.0. I can see there are more quantization options on more recent transformers versions (e.g. 4.37.2). Would this make a difference? Is it worth the headache to update python/pytorch/transformers to squeeze more speed out of CUDA? Thank you for your help.

dusty_nv · February 10, 2024, 5:04am

I recall that versions of transformers after 4.18 dropped support Python 3.6, which is why it’s the last version that appears in pip/PyPi while on JetPack 4. While you could attempt to install Python 3.8 from a PPA (like “deadsnakes”), newer versions of transformers also require newer versions of PyTorch (which in turn don’t tend to support older CUDA/cuDNN versions very long), so that would need to be identified and compiled too. So I would say “not worth the headache” if it can be avoided. It seems PyTorch has always taken extra time the first time GPU is used.

robnewport · February 10, 2024, 6:29am

Hey @dusty_nv I just ran the numbers on Jetson Nano (4GB) versus Intel Z3580 (4GB) and you were totally correct, the first run is way off. Interestingly, about 10 runs are needed before the Jetson GPU reaches its top speed. I switched the test from small to flan-t5-base and ran the same question “Who landed on the moon?” with the following response times (in seconds):

Jetson Nano (4GB): 35.84, 1.47, 1.30, 1.22, 1.19, 1.16, 1.23, 1.14, 1.15, 1.15
Z3580 (4GB): 2.51, 2.46, 2.46, 2.46, 2.47, 2.46, 2.46, 2.46, 2.46, 2.47

In hindsight it totally makes sense that the GPU needs to “prime” itself to reach top speeds whereas the CPU is more consistent. Thank you for the nudge in the right direction and I’ll mark this query as solved.

If you have any pointers or tips on how to optimise flan-t5-base further with quantisation compatible with transformers 4.18.0 I’d appreciate your advice.

dusty_nv · February 12, 2024, 2:14pm

OK great, glad to hear it @robnewport! So regarding the quantization, I put flan-t5 in the same bucket as other LLMs like Llama/ect, and the quantization tools that support those (like AutoGPTQ, AWQ, llama.cpp, exllama, ect) seem to also require newer GPUs (because they typically use optimized CUDA kernels using newer instructions/instrinsics)

system · February 26, 2024, 2:15pm

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Running PyTorch CUDA Jetson Nano pytorch	8	2120	July 13, 2022
Jetson nano sometimes extremely slow with GPU Jetson Nano cuda , pytorch	7	1117	November 3, 2023
Jetson nano slow cuda times with pytorch Jetson Nano cuda , pytorch	14	1067	October 11, 2023
Torch Inference slows down after a few iterations Jetson Nano pytorch	4	620	March 2, 2022
Power delivery issue or faulty board?. (Lags severely) Jetson Nano power , nano2gb	5	557	October 15, 2021
Jetson nanocluster worth it? CUDA Programming and Performance	10	3122	February 17, 2020
Jetson nano GPU is not working? Jetson Nano cuda	5	1853	October 15, 2021
Nano Performing Too Slow After Reflashing: 2 JetPack versions on one Jetson Nano 4GB? Jetson Nano jetpack	9	1014	January 5, 2022
How to run pytorch custom inference on Jetson Nano's GPU? Jetson Nano pytorch	4	1178	June 21, 2022
Pytorch with jetpack 4.2 works slowly than 3.3 Jetson TX2	6	1367	October 18, 2021

Jetson Nano slow CUDA performance vs CPU

Related topics