Jetson nano sometimes extremely slow with GPU

Hello there!

Recently I am running on a very small net in pytorch:

class Net(nn.Module):
    """Simple CNN adapted from 'PyTorch: A 60 Minute Blitz'."""

    def __init__(self) -> None:
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.bn1 = nn.BatchNorm2d(6)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.bn2 = nn.BatchNorm2d(16)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.bn3 = nn.BatchNorm1d(120)
        self.fc2 = nn.Linear(120, 84)
        self.bn4 = nn.BatchNorm1d(84)
        self.fc3 = nn.Linear(84, 10)
        self.relu = nn.ReLU()

    # pylint: disable=arguments-differ,invalid-name
    def forward(self, x: Tensor) -> Tensor:
        """Compute forward pass."""
        x = self.pool(self.relu(self.bn1(self.conv1(x))))
        x = self.pool(self.relu(self.bn2(self.conv2(x))))
        x = x.view(-1, 16 * 5 * 5)
        x = self.relu(self.bn3(self.fc1(x)))
        x = self.relu(self.bn4(self.fc2(x)))
        x = self.fc3(x)
        return x

The dataset is CIFAR10, batch size 64. (In fact I am doing a Federated learning task, so dataset is splitted, and each nano device has ~15-20 batches per epoch).

There are two issues.

  1. The start of training is slow: takes around 60 - 90 s for the nano device to do the initial pass like following
# Perform a single forward pass to properly initialize BatchNorm
    _ = model(next(iter(trainloader))[0].to(DEVICE))

while on a desktop (using GTX 1660 / 2060) the initial pass finishes within seconds.

  1. Occasionally extremely slow training process: It is acceptable that Nano is slower than desktops (for 10 epochs, 15-20 batches/epoch, takes around 15-18s, while on desktops ~2s). But sometimes the Nano device is extremely slow , takes ~ 500 - 900 s to finish 10 epochs. This situation may be resolved by simply reboot, but it come back all of a sudden.

I have tried to turn the power mode by

sudo nvpmodel -m 0
sudo jetson_clocks

but I don’t think this works.

Please help me with this issue.
Many thanks.


If reboot can improve the performance, the slowness might relate to memory usage.
Could you monitor the system memory with tegrastats and share the log with us?



I have monitored serveral nanos (I have 10 of them)

Before the python program starts, the tegrastats gives the following:

RAM 1119/3964MB (lfb 7x4MB) SWAP 496/1982MB (cached 21MB) CPU [1%@102,2%@102,1%@102,0%@102] EMC_FREQ 0% GR3D_FREQ 0% PLL@31C CPU@34C iwlwifi@37C PMIC@50C GPU@35C AO@39C thermal@34.5C

The one that operates well gives following:

RAM 3641/3964MB (lfb 4x1MB) SWAP 1982/1982MB (cached 6MB) CPU [14%@307,20%@307,33%@307,16%@307] EMC_FREQ 0% GR3D_FREQ 0% PLL@34.5C CPU@36.5C iwlwifi@43C PMIC@50C GPU@38C AO@43C thermal@37C

The one that operates extremely slow gives the following:

RAM 3605/3964MB (lfb 2x2MB) SWAP 1982/1982MB (cached 3MB) CPU [17%@710,24%@710,11%@710,12%@710] EMC_FREQ 0% GR3D_FREQ 0% PLL@39.5C CPU@42.5C iwlwifi@40C PMIC@50C GPU@42C AO@48C thermal@42.25C

The one that works well may work very slow for the next run of python program. But if it starts working fast, in this run it will be working fast (and not gonna be slow).


Do you keep allocating buffers when inferencing?
For an inference pipeline, the buffer should be reused for the different input data.


Sorry, I don’t really catch your meaning.

For example in pytorch, you mean over-use of the .to('cuda:0')?

Since I am doing both training & inference, so you mean I repeatedly use .to('cuda:0') to send the model & dataset to GPU during the process?


The slowness seems to happen when the memory reaches its maximum.
So you can try to keep the memory usage within a certain amount to avoid hitting the perf issue.


Thank you!

This problem is solved via

  1. Expanded allocation of swap memory (from 2GB to 4GB). Follow the steps in Link.

  2. Limit the memory allocation size during training. I’m using PyTorch, so I use the following code:

torch.cuda.set_per_process_memory_fraction(fraction=0.5, device="cuda:0")

I think you are right. Unlike PC or server, the Jetson Nano use shared memory between GPU and CPU, so over-allocation for GPU may significantly slow down the main process.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.