Jetson Nano slow CUDA performance vs CPU

I have a Jetson Nano 4GB with a 32GB SD card running a vanilla OS install and a 65 watt micro usb power supply. I installed CUDA (10.2) versions of PyTorch (1.10.0a0+git36449ea) and transformers (4.18.0). I also installed jtop to see the GPU bar move when generate an inference.

I am loading T5 Flan small and getting OK speeds running simple inputs. For example, I would ask “who landed on the moon” and in 31 seconds, it returns “astronaut.”

device = 'cuda:0'
inputs = tokenizer("Who landed on the moon?", return_tensors="pt").to(device)

However, the problem is that the Jetson Nano CPU is faster. When I change the device to ‘cpu’ e.g., device = 'cuda:0' It speeds up to about 27 seconds.

I then decided to test the same script on a bare bones 4GB Z3580 and was able to get speeds of about 12 seconds. This board is 1/3 the price of a Jetson Nano.

My question is, what am I doing wrong, if anything? I already tried some light quantization with torch_dtype=torch.float16 but that did little. Please see my simple code below:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import time
import torch

flan_version = “google/flan-t5-small”
device = ‘cuda:0’
#device = ‘cpu’

start = time.time()

model = AutoModelForSeq2SeqLM.from_pretrained(flan_version,torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(flan_version)

inputs = tokenizer(“Who landed on the moon?”, return_tensors=“pt”).to(device)
outputs =‘cuda:0’).generate(
response = tokenizer.batch_decode(outputs, skip_special_tokens=True)

Hi @robnewport, can you try timing this in a loop running the model for multiple iterations? The very first run after the process starts, it typically takes longer to load all the CUDA kernels, memory, ect. So that first run is typically discarded as a ‘warmup’.

Hey @dusty_nv thank you for that tip, I have not tried that yet and now that you mention it, I can see how it would make sense. I’ll follow up with some posted times when I get back into the office.

I also had another thought – the transformers version on the Jetson Nano is 4.18.0. I can see there are more quantization options on more recent transformers versions (e.g. 4.37.2). Would this make a difference? Is it worth the headache to update python/pytorch/transformers to squeeze more speed out of CUDA? Thank you for your help.

I recall that versions of transformers after 4.18 dropped support Python 3.6, which is why it’s the last version that appears in pip/PyPi while on JetPack 4. While you could attempt to install Python 3.8 from a PPA (like “deadsnakes”), newer versions of transformers also require newer versions of PyTorch (which in turn don’t tend to support older CUDA/cuDNN versions very long), so that would need to be identified and compiled too. So I would say “not worth the headache” if it can be avoided. It seems PyTorch has always taken extra time the first time GPU is used.

Hey @dusty_nv I just ran the numbers on Jetson Nano (4GB) versus Intel Z3580 (4GB) and you were totally correct, the first run is way off. Interestingly, about 10 runs are needed before the Jetson GPU reaches its top speed. I switched the test from small to flan-t5-base and ran the same question “Who landed on the moon?” with the following response times (in seconds):

Jetson Nano (4GB): 35.84, 1.47, 1.30, 1.22, 1.19, 1.16, 1.23, 1.14, 1.15, 1.15
Z3580 (4GB): 2.51, 2.46, 2.46, 2.46, 2.47, 2.46, 2.46, 2.46, 2.46, 2.47

In hindsight it totally makes sense that the GPU needs to “prime” itself to reach top speeds whereas the CPU is more consistent. Thank you for the nudge in the right direction and I’ll mark this query as solved.

If you have any pointers or tips on how to optimise flan-t5-base further with quantisation compatible with transformers 4.18.0 I’d appreciate your advice.

OK great, glad to hear it @robnewport! So regarding the quantization, I put flan-t5 in the same bucket as other LLMs like Llama/ect, and the quantization tools that support those (like AutoGPTQ, AWQ, llama.cpp, exllama, ect) seem to also require newer GPUs (because they typically use optimized CUDA kernels using newer instructions/instrinsics)

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.