Cuda runtime error

Hi, im building a jetbot with the sparkfun jetson nano 2GB kit. The V01-00 image worked though the training model isn’t working properly. It seems to have issues with the cude system. These our the errors I have been getting:
RuntimeError: CUDA error: device-side assert triggered - at one of the times I tried to run the program.
RuntimeError: cuda runtime error (59) : device-side assert triggered at /media/nvidia/WD_BLUE_2.5_1TB/pytorch-v1.1.0/aten/src/THC/generic/THCTensorMath.cu:16 - also one of the trys to run the code.
I would appreciate your assistance to proceed in my project.

Hi,

It seems that you are meeting a similar issue as below:

Could you try the suggestion to see if it works first?
Thanks.

Thanks, so it really seemed to solve that problem but i still can’t run the following code in the tarin module:

NUM_EPOCHS = 30
BEST_MODEL_PATH = ‘best_model.pth’
best_accuracy = 0.0

optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

for epoch in range(NUM_EPOCHS):

for images, labels in iter(train_loader):
    images = images.to(device)
    labels = labels.to(device)
    optimizer.zero_grad()
    outputs = model(images)
    loss = F.cross_entropy(outputs, labels)
    loss.backward()
    optimizer.step()

test_error_count = 0.0
for images, labels in iter(test_loader):
    images = images.to(device)
    labels = labels.to(device)
    outputs = model(images)
    test_error_count += float(torch.sum(torch.abs(labels - outputs.argmax(1))))

test_accuracy = 1.0 - float(test_error_count) / float(len(test_dataset))
print('%d: %f' % (epoch, test_accuracy))
if test_accuracy > best_accuracy:
    torch.save(model.state_dict(), BEST_MODEL_PATH)
    best_accuracy = test_accuracy

I have been getting 2 different error every time i try to run this code:

  1. Server Connection Error - A connection to the Jupyter server could not be established. JupyterLab will continue trying to reconnect. Check your network connection or Jupyter server configuration.
  2. Kernel Restarting - The kernel for Notebooks/collision_avoidance/train_model.ipynb appears to have died. It will restart automatically.

Which in this case I believe doesn’t update the ‘best modul.pth’ which appears on the left side and causes the following error when trying to run the live demo:
IncompatibleKeys(missing_keys=, unexpected_keys=)

Would be happy for further support.

Thanks

Hi,

Since Nano 2GB has relatively limited memory, it may cause some non-expected system behavior.

Would you mind monitoring the memory status with tegrastats at the same time?
This can check if any memory shortage when the connection or kernel issue occurs.

$ sudo tegrastats

Thanks.