Torch/torchvision on Orin NX 16GB Segfault

When I load a model using pytorch, I hit a segmentation fault. I’ve tried various versions of pytorch (1.11, 1.14, 2.0), torchvision (0.12, 0.14), CUDA (11.4, 12.0) as well as the nvcr container nvcr.io/nvidia/l4t-pytorch:r35.2.1-pth2.0-py3 that all result in a segfault. Any advice on how to get pytorch working?

Depending on the version I use, I sometimes hit a segfault just importing torch into python. using GDB, I can see the segfault usually hits here:
Thread 1 “python” received signal SIGSEGV, Segmentation fault.
0x0000ffff27291e30 in ?? () from /lib/aarch64-linux-gnu/libcudnn_cnn_infer.so.8

Machine Versions:
jetpack: 5.1
torch: 1.11
torchvision: 0.12
numpy: 1.20.3
JetsonUtilities/JetsonInfo.py:
NVIDIA NVIDIA Orin NX Developer Kit
L4T 35.2.1 [ JetPack UNKNOWN ]
Ubuntu 20.04.5 LTS
Kernel Version: 5.10.104-tegra
CUDA 11.4.315
CUDA Architecture: 8.7
OpenCV version: 4.5.4
OpenCV Cuda: NO
CUDNN: 8.6.0.166
TensorRT: 8.5.2.2

Hi @mispicer, are you able to import tensorrt or run the cuDNN examples? For example:

cd /usr/src/cudnn_samples_v8/conv_sample
sudo make
./conv_sample

I’m curious if this is just a PyTorch issue you are encountering, or if in fact it’s the environment on your device since even the pre-built PyTorch container is doing the same thing. It sounds like you’ve installed multiple versions of CUDA, and I might recommend re-flashing it to get it back to a known baseline first.

Thanks for the response @dusty_nv ! I’m able to import tensorrt, but I do get a segfault when I run the cuDNN sample:

Executing: conv_sample
Using format CUDNN_TENSOR_NCHW (for INT8x4 and INT8x32 tests use CUDNN_TENSOR_NCHW_VECT_C)
Testing single precision
====USER DIMENSIONS====
input dims are 1, 32, 4, 4
filter dims are 32, 32, 1, 1
output dims are 1, 32, 4, 4
====PADDING DIMENSIONS====
padded input dims are 1, 32, 4, 4
padded filter dims are 32, 32, 1, 1
padded output dims are 1, 32, 4, 4
Segmentation fault (core dumped)

If this provides anything as well:
I am now also receiving a segmentation fault when importing torch now too (instead of just when I try to load a model). And when I run “python3 detect.py --help” from the ultralytics yolov5 repository, I hit a segmentation fault here:
Thread 1 “python” received signal SIGSEGV, Segmentation fault.
0x0000ffffdcdc4720 in ?? () from /lib/aarch64-linux-gnu/libstdc++.so.6

After I first encountered this problem, I did install CUDA 12.0 and cuda-toolkit. I’ve been switching the soft link /usr/local/cuda between /usr/local/cuda-11.4 and /usr/local/cuda-12.0 to switch between versions of CUDA.

It seems strange all these segfaults you are getting from various core libraries - I would probably recommend reflashing your device TBH, and seeing if the issues persist with a clean install.

You could also try this if it’s related to numpy dependency, but I think it’s probably unrelated:

I re-flashed the device (a number of times) and discovered I recreate the error by:

  1. flashing the device with the following command:
    sudo ADDITIONAL_DTB_OVERLAY_OPT=“BootOrderUsb.dtbo” ./tools/kernel_flash/l4t_initrd_flash.sh --network usb0 --external-device sda -p"-c bootloader/t186ref/cfg/flash_t234_qspi.xml --no-systemimg" -c tools/kernel_flash/flash_l4t_external-original.xml p3509-a02+p3767-0000 external
  2. Running ‘sudo apt update’ and then ‘sudo apt install nvidia-jetpack’
  3. sudo reboot

After installing nvidia-jetpack (step 2), the /usr/src/cudnn_samples_v8/conv_sample/conv_sample works perfectly (as well as installing pytorch/torchvision), but if I reboot the device and immediately try the sample again (after step 3), I reach a segmentation fault. Should I not be apt installing nvidia-jetpack – I took the command from the Orin AGX Get Started and thought it might translate to the Orin NX?

Hi @mispicer, let me ask someone from the JetPack-L4T team who is more familiar with the lower-level mechanisms of what’s happening look into this with you. In the meantime, I might suggest installing just the packages you need from apt (i.e. cuda-toolkit-11-4 and libcudnn8-dev libcudnn8-samples) and see if the error still happens then.

1 Like

thanks @dusty_nv, I appreciate the help. I’ll keep playing with it, but so far the same behavior continues with individual packages. I can apt remove and install libcudnn8 & libdunn8-samples to have it work fine, but after I reboot, the same sample will reach a ‘segmentation fault (core dumped)’.

@mispicer does this behavior occur without your device tree overlays or if you boot from eMMC in a stock configuration?

@dusty_nv unfortunately I don’t have an eMMC to test the stock configuration. I also reflashed the device without the additional device tree overlay:

sudo ./tools/kernel_flash/l4t_initrd_flash.sh --network usb0 --external-device sda -p"-c bootloader/t186ref/cfg/flash_t234_qspi.xml --no-systemimg" -c tools/kernel_flash/flash_l4t_external-original.xml p3509-a02+p3767-0000 external

and still ran into the same issue. It works until I reboot the device.

We’re currently restricted to the current configuration (the ideal target carrier board does not support eMMC). Is there anything else I can try, or do I need to find a more standard configuration?

@dusty_nv wanted to follow up in case you had any more suggestions. I’m curious if anyone on your team is able to replicate this issue or if it’s perhaps something on my end. If necessary, I can buy eMMC assuming we think it’s an issue with the hardware configuration.

One thing I would like to try is flashing to a micro-SD card rather than USB flash drive (which is actually just a micro-SD card with a USB-A adapter) – but because the carrier board with micro-SD support is missing it’s micro-B USB port I would need to do it through an ethernet RJ45 cable. the l4t_initrd README suggests this is supported, but does not provide much documentation on how to accomplish this. could you provide any advice or resources on how to flash through an ethernet port rather than USB? Do you think this could be a potential step towards solving this issue?

Hi,

We want to reproduce this issue in our environment as well.
Could you share the detailed steps with us?

More, is there any CUDA library installed in your environment? (in the testing after reflashing)

If there are two CUDA toolkits, maybe the incorrect CUDA is linked after reboot.
The cuDNN needs the CUDA 11.4 which comes with the same JetPack.

Thanks.

Hi @AastaLLL, appreciate you looking into this.

The steps I’ve taken to reproduce the error:

  1. Flash Orin NX (With a USB Dongle attached as storage… this is a microsd card in a microsd-USBA adapter if that offers anything) with the commands given above.
  2. sudo apt update
  3. sudo apt install libcudnn8-dev libcudnn8-samples cuda-toolkit-11-4
  • I’ve tried various combinations of jetpack libraries, all of them have resulted in the same error. Including the entire jetpack meta-package. I’ve tried with one and two cuda toolkits, form which both times I’ve gotten a segfault.

Please let me know if I can specify anything more.

Thanks.

Hi,

Thanks for the detailed steps.
We are testing this internally. Will share more information with you later

Hi,

We cannot reproduce this issue in our environment.
The sample can pass after reboot.

Could you try it with anther USB storage to see if it helps?
Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.