Cuda hangs after installation of jetpack and reboot

Good day,

I use a custom carrier board with AGX Xavier.
The flashing is based on xavier-devkit, but has the adaptations in device tree.
After flashing +

sudo apt update
sudo apt install nvidia-jetpack

I can compile and run Cuda sample successfully

cd  /usr/local/cuda-10.2/samples/0_Simple/matrixMul/
sudo make 
./matrixMul

Then I install Pytorch 1.6 from PyTorch for Jetson - version 1.7.0 now available
with

sudo apt install python3-pip

python3 -m pip install Cython
python3 -m pip install torch-1.6.0-cp36-cp36m-linux_aarch64.whl

At this point, Cuda seems to work and I can execute

import torch
torch.randn(10)
torch.randn(10).cuda()

And /usr/local/cuda-10.2/samples/0_Simple/matrixMul/matrixMul is still working

Then I reboot with

sudo reboot now

After that both PyTorch cuda() and matrixMul hangs

[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "Xavier" with compute capability 7.2

MatrixA(320,320), MatrixB(640,320)

Dmesg gives the following errors:

[   68.445577] Call trace:                                                                                                                    
[   68.445976] [<ffffff8000fcf370>] nvgpu_mem_wr_n+0xd0/0xe0 [nvgpu]                                                                          
[   68.446381] [<ffffff8000ffdcdc>] gr_gk20a_load_golden_ctx_image+0x8c/0x2a0 [nvgpu]                                                         
[   68.446792] [<ffffff8000ffffcc>] gk20a_alloc_obj_ctx+0x6b4/0xac0 [nvgpu]                                                                   
[   68.447183] [<ffffff8000fa12d8>] gk20a_channel_ioctl+0xaf8/0x1320 [nvgpu]                                                                  
[   68.447195] [<ffffff80082724a8>] do_vfs_ioctl+0xb0/0x8d8                                                                                   
[   68.447202] [<ffffff8008272d5c>] SyS_ioctl+0x8c/0xa8                                                                                       
[   68.447212] [<ffffff8008083900>] el0_svc_naked+0x34/0x38                                                                                   
[   68.481120] nvgpu: 17000000.gv11b        gk20a_gr_handle_fecs_error:5298 [ERR]  ctxsw intr0 set by ucode, error_code: 0x00000015           
[   68.481374] ---- mlocks ----  
...
...
...
[  358.016065] pcieport 0004:00:00.0: AER: Corrected error received: id=0000
[  358.016091] pcieport 0004:00:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0000(Transmitter ID)
[  358.016298] pcieport 0004:00:00.0:   device [10de:1ad1] error status/mask=00001000/0000e000
[  358.016448] pcieport 0004:00:00.0:    [12] Replay Timer Timeout  
[  953.074887] pcieport 0004:00:00.0: AER: Corrected error received: id=0000
[  953.074914] pcieport 0004:00:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0000(Transmitter ID)
[  953.075134] pcieport 0004:00:00.0:   device [10de:1ad1] error status/mask=00001000/0000e000
[  953.075279] pcieport 0004:00:00.0:    [12] Replay Timer Timeout  
[ 1017.072592] pcieport 0004:00:00.0: AER: Corrected error received: id=0000
[ 1017.072617] pcieport 0004:00:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0000(Transmitter ID)
[ 1017.072824] pcieport 0004:00:00.0:   device [10de:1ad1] error status/mask=00001000/0000e000
[ 1017.072999] pcieport 0004:00:00.0:    [12] Replay Timer Timeout  
[ 2224.086816] pcieport 0004:00:00.0: AER: Corrected error received: id=0000
[ 2224.086841] pcieport 0004:00:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0000(Transmitter ID)
[ 2224.087047] pcieport 0004:00:00.0:   device [10de:1ad1] error status/mask=00001000/0000e000
[ 2224.087189] pcieport 0004:00:00.0:    [12] Replay Timer Timeout  
[ 2362.063088] pcieport 0004:00:00.0: AER: Corrected error received: id=0000
[ 2362.063114] pcieport 0004:00:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0000(Transmitter ID)
[ 2362.063321] pcieport 0004:00:00.0:   device [10de:1ad1] error status/mask=00001000/0000e000
[ 2362.063471] pcieport 0004:00:00.0:    [12] Replay Timer Timeout 

Update: cuda hangs after installation of nvidia-jetpack + reboot, no pytorch involved.
Can you give me some directions on where to search for the cause and how to interpret dmesg at the end of the post?

2 Likes

I tried to turn off ASPM How to edit kernel's command line

Cuda still hangs after reboot, there is no AER error , but there are other errors
dmesg_aspmoff.txt (159.2 KB)
example:

[   14.913957] nvgpu: 17000000.gv11b        gk20a_gr_handle_fecs_error:5281 [ERR]  fecs watchdog triggered for channel 511, cannot ctxsw anymore !!
[   14.914220] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:129  [ERR]  gr_fecs_os_r : 0
[   14.914376] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:131  [ERR]  gr_fecs_cpuctl_r : 0x40
[   14.914563] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:133  [ERR]  gr_fecs_idlestate_r : 0x1
[   14.914736] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:135  [ERR]  gr_fecs_mailbox0_r : 0x0
[   14.914904] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:137  [ERR]  gr_fecs_mailbox1_r : 0x0
[   14.915069] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:139  [ERR]  gr_fecs_irqstat_r : 0x0
[   14.915258] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:141  [ERR]  gr_fecs_irqmode_r : 0x4
[   14.915987] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:143  [ERR]  gr_fecs_irqmask_r : 0x8705
[   14.916713] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:145  [ERR]  gr_fecs_irqdest_r : 0x0
[   14.920287] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:147  [ERR]  gr_fecs_debug1_r : 0x40
[   14.929731] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:149  [ERR]  gr_fecs_debuginfo_r : 0x0
[   14.939351] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:151  [ERR]  gr_fecs_ctxsw_status_1_r : 0x980
[   14.949336] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(0) : 0x1
[   14.959504] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(1) : 0x0
[   14.970013] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(2) : 0x90009
[   14.980887] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(3) : 0x0
[   14.990739] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(4) : 0x1ffda0
[   15.001736] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(5) : 0x0
[   15.011899] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(6) : 0x15
[   15.022146] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(7) : 0x0
[   15.032613] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(8) : 0x0
[   15.042818] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(9) : 0x0
[   15.052955] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(10) : 0x0
[   15.063170] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(11) : 0x0
[   15.073830] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(12) : 0x0
[   15.084174] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(13) : 0x3fffffff
[   15.095230] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(14) : 0x0
[   15.105461] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:155  [ERR]  gr_fecs_ctxsw_mailbox_r(15) : 0x0
[   15.116023] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:159  [ERR]  gr_fecs_engctl_r : 0x0
[   15.125412] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:161  [ERR]  gr_fecs_curctx_r : 0x0
[   15.134966] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:163  [ERR]  gr_fecs_nxtctx_r : 0x0
[   15.144218] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:169  [ERR]  FECS_FALCON_REG_IMB : 0x0
[   15.153811] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:175  [ERR]  FECS_FALCON_REG_DMB : 0x0
[   15.163165] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:181  [ERR]  FECS_FALCON_REG_CSW : 0x110800
[   15.173197] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:187  [ERR]  FECS_FALCON_REG_CTX : 0x0
[   15.182716] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:193  [ERR]  FECS_FALCON_REG_EXCI : 0x0
[   15.192729] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:200  [ERR]  FECS_FALCON_REG_PC : 0x51c4
[   15.202499] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:206  [ERR]  FECS_FALCON_REG_SP : 0x1f44
[   15.212171] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:200  [ERR]  FECS_FALCON_REG_PC : 0x51c8
[   15.222287] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:206  [ERR]  FECS_FALCON_REG_SP : 0x1f48
[   15.232109] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:200  [ERR]  FECS_FALCON_REG_PC : 0x62
[   15.241738] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:206  [ERR]  FECS_FALCON_REG_SP : 0x1f48
[   15.251516] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:200  [ERR]  FECS_FALCON_REG_PC : 0x51c4
[   15.261290] nvgpu: 17000000.gv11b      gk20a_fecs_dump_falcon_stats:206  [ERR]  FECS_FALCON_REG_SP : 0x1f48
[   17.287905] nvgpu: 17000000.gv11b   nvgpu_set_error_notifier_locked:137  [ERR]  error notifier set to 8 for ch 511
[   17.288168] nvgpu: 17000000.gv11b   gv11b_fifo_handle_ctxsw_timeout:1611 [ERR]  ctxsw timeout error: active engine id =0, tsg=0, info: awaiting ack ms=3100
[   17.288548] ---- mlocks ----

Hi,

We are going to reproduce this issue.
Will update more information later.

Thanks.

We connected to the distributor of this custom board.

They sent us two images

  1. raw
  2. with cuda 10.2

Raw image has the same problem reproduced.
Image with preinstalled cuda works fine.

Thanks for the update.

Good to know this.

1 Like