POWER8 minsky (S822LC) nvidia stalls and kernel panic

Hello.

I have first faced these stalls and now just kernel panics, apparently randomly, when running some cuda programs and nvidia-smi.

When it happens, it usually follows this sequence:

  • Running an executable, written in cuda;
  • Running nvidia-smi (with watch or -l 1).

Then, this happens:

[ 1284.229748] Kernel panic - not syncing: corrupted stack end detected inside scheduler
[ 1284.229748] 
[ 1284.229751] CPU: 48 PID: 9527 Comm: raft_gauss_opti Tainted: P           OE   4.10.0-37-generic #41~16.04.1-Ubuntu
[ 1284.229752] Call Trace:
[ 1284.229756] [c000003f5ebda110] [c000000000bb4d5c] dump_stack+0xb0/0xf0 (unreliable)
[ 1284.229758] [c000003f5ebda150] [c000000000bb0f10] panic+0x144/0x310
[ 1284.229760] [c000003f5ebda1e0] [c000000000ba1ebc] __schedule+0x96c/0x970
[ 1284.229762] [c000003f5ebda2c0] [c000000000ba2328] _cond_resched+0x58/0x80
[ 1284.229764] [c000003f5ebda2f0] [c0000000003229d4] __kmalloc+0x1c4/0x3a0
[ 1284.229863] [c000003f5ebda350] [d00000003fa7e668] os_alloc_mem+0x128/0x170 [nvidia]
[ 1284.229989] [c000003f5ebda380] [d0000000402adb78] _nv006022rm+0x48/0x70 [nvidia]
[ 1284.230116] [c000003f5ebda3b0] [d0000000402ad7fc] _nv006026rm+0x1c/0x40 [nvidia]
[ 1284.230242] [c000003f5ebda3d0] [d0000000402ad1f4] _nv006024rm+0x64/0x160 [nvidia]
[ 1284.230403] [c000003f5ebda400] [d0000000400b6b1c] _nv026745rm+0x65c/0xab0 [nvidia]
[ 1284.230563] [c000003f5ebda530] [d0000000400b1410] _nv005899rm+0x240/0x280 [nvidia]

The trace below was gathered from kdump.

It was not always like that. In the beginning, what we usually saw was the trace below, and one CPU was usually stuck in a loop. But when trying to run nvidia-smi, it got stuck, as well as lsmod.

If there is any hint on that, please I would appreciate.

[ 3619.116726] Kernel panic - not syncing: corrupted stack end detected inside scheduler

[ 3619.116729] CPU: 40 PID: 13291 Comm: raft_gauss_filt Tainted: P           OE   4.10.0-37-generic #41~16.04.1-Ubuntu
[ 3619.116730] Call Trace:
[ 3619.116736] [c00000792e769de0] [c000000000bb4d5c] dump_stack+0xb0/0xf0 (unreliable)
[ 3619.116740] [c00000792e769e20] [c000000000bb0f10] panic+0x144/0x310
[ 3619.116743] [c00000792e769eb0] [c000000000ba1ebc] __schedule+0x96c/0x970
[ 3619.116746] [c00000792e769f90] [c000000000ba2328] _cond_resched+0x58/0x80
[ 3619.116749] [c00000792e769fc0] [c0000000002f1ecc] vunmap+0x3c/0x90
[ 3619.116941] [c00000792e769ff0] [d00000003ff5dd74] nv_vm_unmap_pages+0x94/0xb0 [nvidia]
[ 3619.117128] [c00000792e76a050] [d00000003ff54510] nv_free_kernel_mapping+0x40/0x80 [nvidia]
[ 3619.117325] [c00000792e76a070] [d0000000408b7594] _nv026434rm+0xa4/0x120 [nvidia]
[ 3619.117643] [c00000792e76a0c0] [d0000000404f5c88] _nv023168rm+0x158/0x200 [nvidia]
[ 3619.117873] [c00000792e76a120] [d0000000407fd60c] _nv006241rm+0x2cc/0x6c0 [nvidia]
[ 3619.118105] [c00000792e76a1d0] [d0000000407fe250] _nv001112rm+0x380/0x450 [nvidia]
[ 3619.118337] [c00000792e76a2f0] [d0000000407e5064] _nv000092rm+0x3b4/0x7d0 [nvidia]
[ 3619.118567] [c00000792e76a610] [d00000004080ccd0] _nv023173rm+0x230/0x610 [nvidia]
[ 3619.118810] [c00000792e76a6f0] [d00000004079dc74] _nv003244rm+0x24/0x60 [nvidia]
[ 3619.119053] [c00000792e76a720] [d0000000407962a8] _nv003164rm+0x38/0x60 [nvidia]
[ 3619.119296] [c00000792e76a740] [d000000040795e5c] _nv003451rm+0x3c/0xe0 [nvidia]
[ 3619.119540] [c00000792e76a770] [d000000040793268] _nv008791rm+0x658/0x810 [nvidia]
[ 3619.119784] [c00000792e76a940] [d000000040793528] _nv008790rm+0x108/0x180 [nvidia]
[ 3619.120028] [c00000792e76aa50] [d000000040790a20] _nv030087rm+0x90/0xe0 [nvidia]
[ 3619.120257] [c00000792e76aa70] [d00000004080c7e0] _nv006231rm+0x1e0/0x4a0 [nvidia]
[ 3619.120487] [c00000792e76ab40] [d00000004080c3bc] _nv006229rm+0x31c/0x3b0 [nvidia]
[ 3619.120781] [c00000792e76ac40] [d00000004042415c] _nv010537rm+0xfc/0x3a0 [nvidia]
[ 3619.121018] [c00000792e76ada0] [d0000000407c4890] _nv003176rm+0x20/0x50 [nvidia]
[ 3619.121260] [c00000792e76add0] [d0000000407962a8] _nv003164rm+0x38/0x60 [nvidia]
[ 3619.121503] [c00000792e76adf0] [d000000040795e5c] _nv003451rm+0x3c/0xe0 [nvidia]
[ 3619.121747] [c00000792e76ae20] [d000000040793268] _nv008791rm+0x658/0x810 [nvidia]
[ 3619.121991] [c00000792e76aff0] [d000000040793528] _nv008790rm+0x108/0x180 [nvidia]
[ 3619.122236] [c00000792e76b100] [d000000040790a20] _nv030087rm+0x90/0xe0 [nvidia]
[ 3619.122464] [c00000792e76b120] [d00000004080c7e0] _nv006231rm+0x1e0/0x4a0 [nvidia]
[ 3619.122693] [c00000792e76b1f0] [d00000004080c3bc] _nv006229rm+0x31c/0x3b0 [nvidia]
[ 3619.122926] [c00000792e76b2f0] [d0000000407e0a24] _nv008769rm+0x84/0x2f0 [nvidia]
[ 3619.123164] [c00000792e76b430] [d0000000407c1c24] _nv003916rm+0x34/0x90 [nvidia]
[ 3619.123409] [c00000792e76b450] [d00000004078fd3c] _nv006298rm+0x5c/0x120 [nvidia]
[ 3619.123653] [c00000792e76b4d0] [d0000000407907a0] _nv030084rm+0xf0/0x120 [nvidia]
[ 3619.123881] [c00000792e76b520] [d00000004080c5b8] _nv006230rm+0x58/0xa0 [nvidia]
[ 3619.124109] [c00000792e76b560] [d00000004080c3bc] _nv006229rm+0x31c/0x3b0 [nvidia]
[ 3619.124377] [c00000792e76b660] [d00000004031af10] _nv029431rm+0xf0/0x1a0 [nvidia]
[ 3619.124652] [c00000792e76b700] [d000000040361038] _nv005397rm+0x128/0x150 [nvidia]
[ 3619.124932] [c00000792e76b750] [d0000000403986d0] _nv016044rm+0x30/0x70 [nvidia]
[ 3619.125214] [c00000792e76b770] [d0000000403ab0f0] _nv016015rm+0xb0/0x300 [nvidia]
[ 3619.125484] [c00000792e76b7c0] [d00000004074920c] _nv004654rm+0x2c/0x50 [nvidia]
[ 3619.125776] [c00000792e76b7e0] [d00000004041948c] _nv018038rm+0xec/0x1d0 [nvidia]
[ 3619.126068] [c00000792e76b870] [d000000040419dd8] _nv018039rm+0x7e8/0x8b0 [nvidia]
[ 3619.126261] [c00000792e76b930] [d0000000408cf54c] _nv000975rm+0x42c/0x510 [nvidia]
[ 3619.126456] [c00000792e76b9c0] [d0000000408c28bc] rm_disable_adapter+0x5c/0xe0 [nvidia]
[ 3619.126642] [c00000792e76bab0] [d00000003ff519fc] nv_shutdown_adapter+0x3c/0xe0 [nvidia]
[ 3619.126828] [c00000792e76baf0] [d00000003ff51b70] nv_close_device+0xd0/0x230 [nvidia]
[ 3619.127015] [c00000792e76bb70] [d00000003ff57264] nvidia_close+0x114/0x470 [nvidia]
[ 3619.127202] [c00000792e76bc20] [d00000003ff50670] nvidia_frontend_close+0x60/0xa0 [nvidia]
[ 3619.127206] [c00000792e76bc50] [c000000000362968] __fput+0xe8/0x310
[ 3619.127209] [c00000792e76bcb0] [c000000000115de0] task_work_run+0x140/0x1a0
[ 3619.127211] [c00000792e76bd00] [c0000000000f03ec] do_exit+0x3ac/0xc80
[ 3619.127213] [c00000792e76bdd0] [c0000000000f0d94] do_group_exit+0x64/0x100
[ 3619.127216] [c00000792e76be10] [c0000000000f0e58] SyS_exit_group+0x28/0x30
[ 3619.127219] [c00000792e76be30] [c00000000000b184] system_call+0x38/0xe0

Edit 1: FWIW, the GPUs are 4 Tesla P100.
Edit 2: Included complete call trace from kernel panic.