I tried to turn off ASPM How to edit kernel's command line - #7 by linuxdev
Cuda still hangs after reboot, there is no AER error , but there are other errors
dmesg_aspmoff.txt (159.2 KB)
example:
[ 14.913957] nvgpu: 17000000.gv11b gk20a_gr_handle_fecs_error:5281 [ERR] fecs watchdog triggered for channel 511, cannot ctxsw anymore !!
[ 14.914220] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:129 [ERR] gr_fecs_os_r : 0
[ 14.914376] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:131 [ERR] gr_fecs_cpuctl_r : 0x40
[ 14.914563] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:133 [ERR] gr_fecs_idlestate_r : 0x1
[ 14.914736] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:135 [ERR] gr_fecs_mailbox0_r : 0x0
[ 14.914904] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:137 [ERR] gr_fecs_mailbox1_r : 0x0
[ 14.915069] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:139 [ERR] gr_fecs_irqstat_r : 0x0
[ 14.915258] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:141 [ERR] gr_fecs_irqmode_r : 0x4
[ 14.915987] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:143 [ERR] gr_fecs_irqmask_r : 0x8705
[ 14.916713] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:145 [ERR] gr_fecs_irqdest_r : 0x0
[ 14.920287] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:147 [ERR] gr_fecs_debug1_r : 0x40
[ 14.929731] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:149 [ERR] gr_fecs_debuginfo_r : 0x0
[ 14.939351] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:151 [ERR] gr_fecs_ctxsw_status_1_r : 0x980
[ 14.949336] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:155 [ERR] gr_fecs_ctxsw_mailbox_r(0) : 0x1
[ 14.959504] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:155 [ERR] gr_fecs_ctxsw_mailbox_r(1) : 0x0
[ 14.970013] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:155 [ERR] gr_fecs_ctxsw_mailbox_r(2) : 0x90009
[ 14.980887] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:155 [ERR] gr_fecs_ctxsw_mailbox_r(3) : 0x0
[ 14.990739] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:155 [ERR] gr_fecs_ctxsw_mailbox_r(4) : 0x1ffda0
[ 15.001736] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:155 [ERR] gr_fecs_ctxsw_mailbox_r(5) : 0x0
[ 15.011899] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:155 [ERR] gr_fecs_ctxsw_mailbox_r(6) : 0x15
[ 15.022146] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:155 [ERR] gr_fecs_ctxsw_mailbox_r(7) : 0x0
[ 15.032613] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:155 [ERR] gr_fecs_ctxsw_mailbox_r(8) : 0x0
[ 15.042818] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:155 [ERR] gr_fecs_ctxsw_mailbox_r(9) : 0x0
[ 15.052955] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:155 [ERR] gr_fecs_ctxsw_mailbox_r(10) : 0x0
[ 15.063170] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:155 [ERR] gr_fecs_ctxsw_mailbox_r(11) : 0x0
[ 15.073830] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:155 [ERR] gr_fecs_ctxsw_mailbox_r(12) : 0x0
[ 15.084174] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:155 [ERR] gr_fecs_ctxsw_mailbox_r(13) : 0x3fffffff
[ 15.095230] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:155 [ERR] gr_fecs_ctxsw_mailbox_r(14) : 0x0
[ 15.105461] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:155 [ERR] gr_fecs_ctxsw_mailbox_r(15) : 0x0
[ 15.116023] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:159 [ERR] gr_fecs_engctl_r : 0x0
[ 15.125412] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:161 [ERR] gr_fecs_curctx_r : 0x0
[ 15.134966] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:163 [ERR] gr_fecs_nxtctx_r : 0x0
[ 15.144218] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:169 [ERR] FECS_FALCON_REG_IMB : 0x0
[ 15.153811] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:175 [ERR] FECS_FALCON_REG_DMB : 0x0
[ 15.163165] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:181 [ERR] FECS_FALCON_REG_CSW : 0x110800
[ 15.173197] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:187 [ERR] FECS_FALCON_REG_CTX : 0x0
[ 15.182716] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:193 [ERR] FECS_FALCON_REG_EXCI : 0x0
[ 15.192729] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:200 [ERR] FECS_FALCON_REG_PC : 0x51c4
[ 15.202499] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:206 [ERR] FECS_FALCON_REG_SP : 0x1f44
[ 15.212171] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:200 [ERR] FECS_FALCON_REG_PC : 0x51c8
[ 15.222287] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:206 [ERR] FECS_FALCON_REG_SP : 0x1f48
[ 15.232109] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:200 [ERR] FECS_FALCON_REG_PC : 0x62
[ 15.241738] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:206 [ERR] FECS_FALCON_REG_SP : 0x1f48
[ 15.251516] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:200 [ERR] FECS_FALCON_REG_PC : 0x51c4
[ 15.261290] nvgpu: 17000000.gv11b gk20a_fecs_dump_falcon_stats:206 [ERR] FECS_FALCON_REG_SP : 0x1f48
[ 17.287905] nvgpu: 17000000.gv11b nvgpu_set_error_notifier_locked:137 [ERR] error notifier set to 8 for ch 511
[ 17.288168] nvgpu: 17000000.gv11b gv11b_fifo_handle_ctxsw_timeout:1611 [ERR] ctxsw timeout error: active engine id =0, tsg=0, info: awaiting ack ms=3100
[ 17.288548] ---- mlocks ----