Jetson nano sometimes extremely slow with GPU

yuziyi · October 21, 2023, 12:22pm

Hello there!

Recently I am running on a very small net in pytorch:

class Net(nn.Module):
    """Simple CNN adapted from 'PyTorch: A 60 Minute Blitz'."""

    def __init__(self) -> None:
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.bn1 = nn.BatchNorm2d(6)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.bn2 = nn.BatchNorm2d(16)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.bn3 = nn.BatchNorm1d(120)
        self.fc2 = nn.Linear(120, 84)
        self.bn4 = nn.BatchNorm1d(84)
        self.fc3 = nn.Linear(84, 10)
        self.relu = nn.ReLU()

    # pylint: disable=arguments-differ,invalid-name
    def forward(self, x: Tensor) -> Tensor:
        """Compute forward pass."""
        
        x = self.pool(self.relu(self.bn1(self.conv1(x))))
        
        x = self.pool(self.relu(self.bn2(self.conv2(x))))
       
        x = x.view(-1, 16 * 5 * 5)
        x = self.relu(self.bn3(self.fc1(x)))
        
        x = self.relu(self.bn4(self.fc2(x)))
        
        x = self.fc3(x)
        
        return x

The dataset is CIFAR10, batch size 64. (In fact I am doing a Federated learning task, so dataset is splitted, and each nano device has ~15-20 batches per epoch).

There are two issues.

The start of training is slow: takes around 60 - 90 s for the nano device to do the initial pass like following

# Perform a single forward pass to properly initialize BatchNorm
    _ = model(next(iter(trainloader))[0].to(DEVICE))

while on a desktop (using GTX 1660 / 2060) the initial pass finishes within seconds.

Occasionally extremely slow training process: It is acceptable that Nano is slower than desktops (for 10 epochs, 15-20 batches/epoch, takes around 15-18s, while on desktops ~2s). But sometimes the Nano device is extremely slow , takes ~ 500 - 900 s to finish 10 epochs. This situation may be resolved by simply reboot, but it come back all of a sudden.

I have tried to turn the power mode by

sudo nvpmodel -m 0
sudo jetson_clocks

but I don’t think this works.

Please help me with this issue.
Many thanks.

AastaLLL · October 23, 2023, 3:45am

Hi,

If reboot can improve the performance, the slowness might relate to memory usage.
Could you monitor the system memory with tegrastats and share the log with us?

Thanks.

yuziyi · October 23, 2023, 6:56am

Yes.

I have monitored serveral nanos (I have 10 of them)

Before the python program starts, the tegrastats gives the following:

RAM 1119/3964MB (lfb 7x4MB) SWAP 496/1982MB (cached 21MB) CPU [1%@102,2%@102,1%@102,0%@102] EMC_FREQ 0% GR3D_FREQ 0% PLL@31C CPU@34C iwlwifi@37C PMIC@50C GPU@35C AO@39C thermal@34.5C

The one that operates well gives following:

RAM 3641/3964MB (lfb 4x1MB) SWAP 1982/1982MB (cached 6MB) CPU [14%@307,20%@307,33%@307,16%@307] EMC_FREQ 0% GR3D_FREQ 0% PLL@34.5C CPU@36.5C iwlwifi@43C PMIC@50C GPU@38C AO@43C thermal@37C

The one that operates extremely slow gives the following:

RAM 3605/3964MB (lfb 2x2MB) SWAP 1982/1982MB (cached 3MB) CPU [17%@710,24%@710,11%@710,12%@710] EMC_FREQ 0% GR3D_FREQ 0% PLL@39.5C CPU@42.5C iwlwifi@40C PMIC@50C GPU@42C AO@48C thermal@42.25C

The one that works well may work very slow for the next run of python program. But if it starts working fast, in this run it will be working fast (and not gonna be slow).

AastaLLL · October 24, 2023, 6:26am

Hi,

Do you keep allocating buffers when inferencing?
For an inference pipeline, the buffer should be reused for the different input data.

Thanks.

yuziyi · October 24, 2023, 6:50am

Sorry, I don’t really catch your meaning.

For example in pytorch, you mean over-use of the .to('cuda:0')?

Since I am doing both training & inference, so you mean I repeatedly use .to('cuda:0') to send the model & dataset to GPU during the process?

AastaLLL · October 25, 2023, 6:55am

Hi,

The slowness seems to happen when the memory reaches its maximum.
So you can try to keep the memory usage within a certain amount to avoid hitting the perf issue.

Thanks.

yuziyi · November 3, 2023, 7:38am

Thank you!

This problem is solved via

Expanded allocation of swap memory (from 2GB to 4GB). Follow the steps in Link.
Limit the memory allocation size during training. I’m using PyTorch, so I use the following code:

torch.cuda.set_per_process_memory_fraction(fraction=0.5, device="cuda:0")

I think you are right. Unlike PC or server, the Jetson Nano use shared memory between GPU and CPU, so over-allocation for GPU may significantly slow down the main process.

system · November 17, 2023, 7:38am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Running PyTorch CUDA Jetson Nano pytorch	8	2036	July 13, 2022
Slow CUDA Loading&Initialisation / GPU Warmup issue Jetson Orin Nano cuda	7	1199	July 21, 2023
Jetson nano slow cuda times with pytorch Jetson Nano cuda , pytorch	14	925	October 11, 2023
The speed of GPU when running the code error in jetson nano Jetson Nano cuda	4	580	November 4, 2021
Jetson-inference: Retraining cat_dog using train.py is not running Jetson Nano	8	919	October 14, 2021
Strange jumping results on FPS and inference time Jetson Nano	9	1163	October 18, 2021
Torch Tensor.cuda() very slow Jetson TX2 pytorch	6	3190	October 18, 2021
Torch Inference slows down after a few iterations Jetson Nano pytorch	4	612	March 2, 2022
How to run pytorch custom inference on Jetson Nano's GPU? Jetson Nano pytorch	4	1121	June 21, 2022
Jetson-inference: cannot train model with custom data set Jetson Nano jetson-inference	11	1938	March 9, 2022

Jetson nano sometimes extremely slow with GPU

Related topics