Recently I am running on a very small net in pytorch:
class Net(nn.Module):
"""Simple CNN adapted from 'PyTorch: A 60 Minute Blitz'."""
def __init__(self) -> None:
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 6, 5)
self.bn1 = nn.BatchNorm2d(6)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(6, 16, 5)
self.bn2 = nn.BatchNorm2d(16)
self.fc1 = nn.Linear(16 * 5 * 5, 120)
self.bn3 = nn.BatchNorm1d(120)
self.fc2 = nn.Linear(120, 84)
self.bn4 = nn.BatchNorm1d(84)
self.fc3 = nn.Linear(84, 10)
self.relu = nn.ReLU()
# pylint: disable=arguments-differ,invalid-name
def forward(self, x: Tensor) -> Tensor:
"""Compute forward pass."""
x = self.pool(self.relu(self.bn1(self.conv1(x))))
x = self.pool(self.relu(self.bn2(self.conv2(x))))
x = x.view(-1, 16 * 5 * 5)
x = self.relu(self.bn3(self.fc1(x)))
x = self.relu(self.bn4(self.fc2(x)))
x = self.fc3(x)
return x
The dataset is CIFAR10, batch size 64. (In fact I am doing a Federated learning task, so dataset is splitted, and each nano device has ~15-20 batches per epoch).
There are two issues.
The start of training is slow: takes around 60 - 90 s for the nano device to do the initial pass like following
# Perform a single forward pass to properly initialize BatchNorm
_ = model(next(iter(trainloader))[0].to(DEVICE))
while on a desktop (using GTX 1660 / 2060) the initial pass finishes within seconds.
Occasionally extremely slow training process: It is acceptable that Nano is slower than desktops (for 10 epochs, 15-20 batches/epoch, takes around 15-18s, while on desktops ~2s). But sometimes the Nano device is extremely slow , takes ~ 500 - 900 s to finish 10 epochs. This situation may be resolved by simply reboot, but it come back all of a sudden.
If reboot can improve the performance, the slowness might relate to memory usage.
Could you monitor the system memory with tegrastats and share the log with us?
The one that works well may work very slow for the next run of python program. But if it starts working fast, in this run it will be working fast (and not gonna be slow).
The slowness seems to happen when the memory reaches its maximum.
So you can try to keep the memory usage within a certain amount to avoid hitting the perf issue.
I think you are right. Unlike PC or server, the Jetson Nano use shared memory between GPU and CPU, so over-allocation for GPU may significantly slow down the main process.