Good day,
I use a custom carrier board with AGX Xavier.
The flashing is based on xavier-devkit, but has the adaptations in device tree.
After flashing +
sudo apt update
sudo apt install nvidia-jetpack
I can compile and run Cuda sample successfully
cd /usr/local/cuda-10.2/samples/0_Simple/matrixMul/
sudo make
./matrixMul
Then I install Pytorch 1.6 from PyTorch for Jetson
with
sudo apt install python3-pip
python3 -m pip install Cython
python3 -m pip install torch-1.6.0-cp36-cp36m-linux_aarch64.whl
At this point, Cuda seems to work and I can execute
import torch
torch.randn(10)
torch.randn(10).cuda()
And /usr/local/cuda-10.2/samples/0_Simple/matrixMul/matrixMul is still working
Then I reboot with
sudo reboot now
After that both PyTorch cuda() and matrixMul hangs
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "Xavier" with compute capability 7.2
MatrixA(320,320), MatrixB(640,320)
Dmesg gives the following errors:
[ 68.445577] Call trace:
[ 68.445976] [<ffffff8000fcf370>] nvgpu_mem_wr_n+0xd0/0xe0 [nvgpu]
[ 68.446381] [<ffffff8000ffdcdc>] gr_gk20a_load_golden_ctx_image+0x8c/0x2a0 [nvgpu]
[ 68.446792] [<ffffff8000ffffcc>] gk20a_alloc_obj_ctx+0x6b4/0xac0 [nvgpu]
[ 68.447183] [<ffffff8000fa12d8>] gk20a_channel_ioctl+0xaf8/0x1320 [nvgpu]
[ 68.447195] [<ffffff80082724a8>] do_vfs_ioctl+0xb0/0x8d8
[ 68.447202] [<ffffff8008272d5c>] SyS_ioctl+0x8c/0xa8
[ 68.447212] [<ffffff8008083900>] el0_svc_naked+0x34/0x38
[ 68.481120] nvgpu: 17000000.gv11b gk20a_gr_handle_fecs_error:5298 [ERR] ctxsw intr0 set by ucode, error_code: 0x00000015
[ 68.481374] ---- mlocks ----
...
...
...
[ 358.016065] pcieport 0004:00:00.0: AER: Corrected error received: id=0000
[ 358.016091] pcieport 0004:00:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0000(Transmitter ID)
[ 358.016298] pcieport 0004:00:00.0: device [10de:1ad1] error status/mask=00001000/0000e000
[ 358.016448] pcieport 0004:00:00.0: [12] Replay Timer Timeout
[ 953.074887] pcieport 0004:00:00.0: AER: Corrected error received: id=0000
[ 953.074914] pcieport 0004:00:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0000(Transmitter ID)
[ 953.075134] pcieport 0004:00:00.0: device [10de:1ad1] error status/mask=00001000/0000e000
[ 953.075279] pcieport 0004:00:00.0: [12] Replay Timer Timeout
[ 1017.072592] pcieport 0004:00:00.0: AER: Corrected error received: id=0000
[ 1017.072617] pcieport 0004:00:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0000(Transmitter ID)
[ 1017.072824] pcieport 0004:00:00.0: device [10de:1ad1] error status/mask=00001000/0000e000
[ 1017.072999] pcieport 0004:00:00.0: [12] Replay Timer Timeout
[ 2224.086816] pcieport 0004:00:00.0: AER: Corrected error received: id=0000
[ 2224.086841] pcieport 0004:00:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0000(Transmitter ID)
[ 2224.087047] pcieport 0004:00:00.0: device [10de:1ad1] error status/mask=00001000/0000e000
[ 2224.087189] pcieport 0004:00:00.0: [12] Replay Timer Timeout
[ 2362.063088] pcieport 0004:00:00.0: AER: Corrected error received: id=0000
[ 2362.063114] pcieport 0004:00:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0000(Transmitter ID)
[ 2362.063321] pcieport 0004:00:00.0: device [10de:1ad1] error status/mask=00001000/0000e000
[ 2362.063471] pcieport 0004:00:00.0: [12] Replay Timer Timeout
Update: cuda hangs after installation of nvidia-jetpack + reboot, no pytorch involved.
Can you give me some directions on where to search for the cause and how to interpret dmesg at the end of the post?