Tesla K20C: I2C communication + CUDA resulting in system breakdown ("NVRM: os_schedule: Attempt

[Reported this as a bug, but also posting it here in the case that others can shed some light on the issue]

uname -a
Linux moria 3.13.0-35-generic #62-Ubuntu SMP Fri Aug 15 01:58:42 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.1 LTS
Release: 14.04
Codename: trusty

I am trying to communicate with an I2C device on a Tesla GPU (specifically, an INA3221 power rail sensor) to get fast readings of the power consumption of the card. For this, I am using the i2c userspace interface provided in Ubuntu’s package repositories (package libi2c-dev). I am constantly reading the sensors (tested with and without 10 ms delays between the readings) through the function i2c_smbus_read_word_data(…). This produces a lot of output, but occasionally, the system seems to hang. For example, see the following dmesg log output (please read the comments on the side of the log):

[ 917.503276] I2C_algo_smbus_xfer: rw 1, command 3, addr 40 // The I2C_* are debug printouts I added to the NVIDIA frontend module, file nv-i2c.c
[ 917.503277] I2C_SMBUS_WORD_DATA
[ 917.503338] NVRM: ioctl(0x57, 0xddf88af0, 0x38) // I set the loglevel in the frontend sources to “DBG_LEVEL_INFO”
[ 917.503664] I2C_algo_smbus_xfer: rw 1, command 5, addr 40
[ 917.503673] I2C_SMBUS_WORD_DATA
[ 917.503720] NVRM: ioctl(0x4a, 0xddf88cd0, 0xa0)
[ 921.501054] rmStatus 0x38 (56) // This is a debug print at the end of the algo_smbus_transfer function. It prints on error from rm_i2c_transfer. NOTICE that it has spent 3-4 seconds in this function - obviously, something in the “hidden” nvidia code (function rm_i2c_transfer) has taken a very long time compared to the other function calls to algo_smbus_transfer. rmStatus is set to RM_ERR_TIMEOUT
[ 921.501148] NVRM: ioctl(0x57, 0xddf88b80, 0x38) // Other driver communication starts here, don’t know if this has anything to do with the issue. This is your playground… ;)
[ 921.502287] NVRM: ioctl(0x2a, 0xddf88b80, 0x20)
[ 921.502311] NVRM: ioctl(0x57, 0xddf88af0, 0x38)
[ 921.502401] NVRM: ioctl(0x58, 0xddf88dd0, 0x28)
[ 921.502439] NVRM: ioctl(0x58, 0xddf88da0, 0x28)
[ 921.502462] NVRM: ioctl(0x29, 0xddf88dc0, 0x10)
[ 921.503503] NVRM: ioctl(0x2a, 0xddf88490, 0x20)
[ 921.503540] NVRM: ioctl(0x41, 0xddf88580, 0x38)
[ 921.506597] NVRM: ioctl(0x58, 0xddf88dd0, 0x28)
[ 921.506634] NVRM: ioctl(0x58, 0xddf88da0, 0x28)
[ 921.506658] NVRM: ioctl(0x29, 0xddf88dc0, 0x10)
[ 921.506781] NVRM: ioctl(0x58, 0xe7dc1b60, 0x28)
[ 921.506816] NVRM: ioctl(0x29, 0xe7dc1b10, 0x10)
[ 921.506847] NVRM: ioctl(0x34, 0xe7dc1b50, 0x1c)
[ 921.506874] NVRM: ioctl(0x57, 0xe7dc1b70, 0x38)
[ 921.507240] NVRM: ioctl(0x58, 0xddf88dd0, 0x28)

The issue happens very often, usually within the first 20 seconds after I start the printing. It seems that the system recovers from this condition by itself.

The worst thing happens when I try to run a CUDA kernel WHILE I am also reading the power consumption through I2C as described previously. In this case, the system CAN hang completely (happens most of the time, but not always) and doesn’t recover, and I get new errors in dmesg (again, refer to my comments, I will try to explain my observations):

Sep 5 09:55:12 moria kernel: [ 211.483759] NVRM: ioctl(0x2b, 0xf0a11590, 0x20)
Sep 5 09:55:12 moria kernel: [ 211.484122] I2C_algo_smbus_xfer: rw 1, command 1, addr 40 // My I2C C program is polling the INA3221 through NVIDIA’s I2C interface. This works perfectly for some time…
Sep 5 09:55:12 moria kernel: [ 211.484123] I2C_SMBUS_WORD_DATA
Sep 5 09:55:12 moria kernel: [ 211.484170] NVRM: ioctl(0x2b, 0xf0a11590, 0x20)
Sep 5 09:55:12 moria kernel: [ 211.484593] I2C_algo_smbus_xfer: rw 1, command 3, addr 40 // Notice here that it takes a loong time until the next I2C_algo_smbus_xfer… From my angle it seems like the timeout error above… In fact, the rm_i2c_transfer doesn’t return the timeout error before time 241.456928 below. But so many other interesting things happen before that now.
Sep 5 09:55:12 moria kernel: [ 211.484594] I2C_SMBUS_WORD_DATA
Sep 5 09:55:12 moria kernel: [ 211.790468] NVRM: ioctl(0x52, 0xe77e4c50, 0x10)
Sep 5 09:55:13 moria kernel: [ 212.297783] NVRM: ioctl(0x4a, 0x1b6e1880, 0xa0)
Sep 5 09:55:13 moria kernel: [ 212.483760] NVRM: ioctl(0x2a, 0xf0a11660, 0x20)
Sep 5 09:55:14 moria kernel: [ 213.483162] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Sep 5 09:55:16 moria kernel: [ 215.481212] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Sep 5 09:55:18 moria kernel: [ 217.479262] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Sep 5 09:55:20 moria kernel: [ 219.477312] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Sep 5 09:55:22 moria kernel: [ 221.475360] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Sep 5 09:55:24 moria kernel: [ 223.473409] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Sep 5 09:55:26 moria kernel: [ 225.471458] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Sep 5 09:55:27 moria kernel: [ 226.967543] usb 3-13: USB disconnect, device number 4
Sep 5 09:55:28 moria kernel: [ 227.469508] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Sep 5 09:55:30 moria kernel: [ 229.467556] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Sep 5 09:55:32 moria kernel: [ 231.465605] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Sep 5 09:55:34 moria kernel: [ 233.463655] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Sep 5 09:55:36 moria kernel: [ 235.461705] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Sep 5 09:55:38 moria kernel: [ 237.459753] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Sep 5 09:55:40 moria kernel: [ 239.457802] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Sep 5 09:55:42 moria kernel: [ 241.456928] rmStatus 0x38 (56)
Sep 5 09:55:42 moria kernel: [ 241.457173] NVRM: ioctl(0x57, 0x1b6e1730, 0x38)
Sep 5 09:55:42 moria colord: device removed: sysfs-(null)
Sep 5 09:55:44 moria kernel: [ 243.454294] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Sep 5 09:55:46 moria kernel: [ 245.452343] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Sep 5 09:55:48 moria kernel: [ 247.450391] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Sep 5 09:55:50 moria kernel: [ 249.448439] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Sep 5 09:55:52 moria kernel: [ 251.446486] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Sep 5 09:55:54 moria kernel: [ 253.444533] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Sep 5 09:55:56 moria kernel: [ 255.442581] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Sep 5 09:55:58 moria kernel: [ 257.440628] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Sep 5 09:56:00 moria kernel: [ 259.438676] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Sep 5 09:56:00 moria colord: device removed: sysfs-SAMSUNG-SAMSUNG_Android
Sep 5 09:56:02 moria kernel: [ 261.436724] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Sep 5 09:56:04 moria kernel: [ 263.434771] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Sep 5 09:56:04 moria kernel: [ 263.831736] BUG: soft lockup - CPU#0 stuck for 21s! [systemd-udevd:4379]
Sep 5 09:56:04 moria kernel: [ 263.831737] Modules linked in: nvidia_uvm(POF) btrfs raid6_pq xor ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs libcrc32c i2c_dev hid_generic hid_logitech_dj usbhid hid ch341 usbserial bnep rfcomm bluetooth binfmt_misc snd_hda_codec_hdmi nls_iso8859_1 nvidia(POF) x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel snd_hda_codec_realtek kvm snd_hda_intel snd_hda_codec crct10dif_pclmul crc32_pclmul snd_hwdep nouveau snd_pcm ghash_clmulni_intel snd_page_alloc snd_seq_midi aesni_intel snd_seq_midi_event aes_x86_64 snd_rawmidi lrw snd_seq gf128mul glue_helper ttm ablk_helper drm_kms_helper snd_seq_device cryptd snd_timer drm serio_raw mxm_wmi i2c_algo_bit snd soundcore parport_pc shpchp ppdev video mei_me wmi intel_smartconnect acpi_pad mei lp mac_hid parport ahci psmouse r8169 libahci mii
Sep 5 09:56:04 moria kernel: [ 263.831773] CPU: 0 PID: 4379 Comm: systemd-udevd Tainted: PF O 3.13.0-35-generic #62-Ubuntu
Sep 5 09:56:04 moria kernel: [ 263.831774] Hardware name: MSI MS-7921/Z97-G55 SLI (MS-7921), BIOS V1.0 04/21/2014
Sep 5 09:56:04 moria kernel: [ 263.831776] task: ffff88022f40e000 ti: ffff8800a1a30000 task.ti: ffff8800a1a30000
Sep 5 09:56:04 moria kernel: [ 263.831777] RIP: 0010:[] [] _nv013406rm+0x13/0x40 [nvidia]
Sep 5 09:56:04 moria kernel: [ 263.831827] RSP: 0018:ffff88023ec03c28 EFLAGS: 00000202
Sep 5 09:56:04 moria kernel: [ 263.831828] RAX: 0000000000000040 RBX: 0000003d7ca20c42 RCX: 0000000000000040
Sep 5 09:56:04 moria kernel: [ 263.831829] RDX: ffffc90008000000 RSI: ffff8800baec0008 RDI: ffff88023269c808
Sep 5 09:56:04 moria kernel: [ 263.831830] RBP: ffff8800b985dd40 R08: 0000000000000246 R09: ffff8800b985dd54
Sep 5 09:56:04 moria kernel: [ 263.831831] R10: 0000000000000000 R11: ffffffffa1136bc0 R12: ffff88023ec03b98
Sep 5 09:56:04 moria kernel: [ 263.831832] R13: ffffffff8172fadd R14: ffff8800b985dd40 R15: 0000000000100000
Sep 5 09:56:04 moria kernel: [ 263.831834] FS: 00007f5ddb0b7880(0000) GS:ffff88023ec00000(0000) knlGS:0000000000000000
Sep 5 09:56:04 moria kernel: [ 263.831835] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 5 09:56:04 moria kernel: [ 263.831836] CR2: 00007f5dd9902230 CR3: 00000000a1a24000 CR4: 00000000001407f0
Sep 5 09:56:04 moria kernel: [ 263.831837] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Sep 5 09:56:04 moria kernel: [ 263.831838] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Sep 5 09:56:04 moria kernel: [ 263.831838] Stack:
Sep 5 09:56:04 moria kernel: [ 263.831839] ffff8800baec0008 ffffffffa1136cb6 ffff8800baec0008 0000000000404e1e
Sep 5 09:56:04 moria kernel: [ 263.831842] 0000000000000100 0000000000000002 ffff8800bb1b4008 ffffffffa113709d
Sep 5 09:56:04 moria kernel: [ 263.831844] ffff8800baec0008 ffff8800b9eea008 0000000000000000 ffffffffa11f1225
Sep 5 09:56:04 moria kernel: [ 263.831845] Call Trace:
Sep 5 09:56:04 moria kernel: [ 263.831847]
Sep 5 09:56:04 moria kernel: [ 263.831848] [] ? _nv008899rm+0x1896/0x5c10 [nvidia]
Sep 5 09:56:04 moria kernel: [ 263.832032] [] ? _nv008899rm+0x1c7d/0x5c10 [nvidia]
Sep 5 09:56:04 moria kernel: [ 263.832119] [] ? _nv011935rm+0x55/0x130 [nvidia]
Sep 5 09:56:04 moria kernel: [ 263.832203] [] ? _nv012108rm+0xc6/0x300 [nvidia]
Sep 5 09:56:04 moria kernel: [ 263.832281] [] ? _nv003798rm+0x46d/0x20a0 [nvidia]
Sep 5 09:56:04 moria kernel: [ 263.832359] [] ? _nv003798rm+0x8ee/0x20a0 [nvidia]
Sep 5 09:56:04 moria kernel: [ 263.832442] [] ? _nv012194rm+0x1cd/0x5f0 [nvidia]
Sep 5 09:56:04 moria kernel: [ 263.832529] [] ? _nv012144rm+0x9e/0x160 [nvidia]
Sep 5 09:56:04 moria kernel: [ 263.832617] [] ? _nv011815rm+0x159/0x1b0 [nvidia]
Sep 5 09:56:04 moria kernel: [ 263.832704] [] ? _nv012184rm+0xe4d/0x1040 [nvidia]
Sep 5 09:56:04 moria kernel: [ 263.832791] [] ? _nv012184rm+0xab2/0x1040 [nvidia]
Sep 5 09:56:04 moria kernel: [ 263.832847] [] ? _nv016850rm+0x227/0x3a0 [nvidia]
Sep 5 09:56:04 moria kernel: [ 263.832902] [] ? _nv016853rm+0x28/0x80 [nvidia]
Sep 5 09:56:04 moria kernel: [ 263.832986] [] ? _nv001301rm+0x194/0x240 [nvidia]
Sep 5 09:56:04 moria kernel: [ 263.833074] [] ? _nv012100rm+0x47/0x70 [nvidia]
Sep 5 09:56:04 moria kernel: [ 263.833110] [] ? _nv000792rm+0x105/0x140 [nvidia]
Sep 5 09:56:04 moria kernel: [ 263.833147] [] ? rm_isr_bh+0x23/0x70 [nvidia]
Sep 5 09:56:04 moria kernel: [ 263.833183] [] ? nvidia_isr_bh+0x3b/0x60 [nvidia]
Sep 5 09:56:04 moria kernel: [ 263.833187] [] ? tasklet_action+0xf6/0x100
Sep 5 09:56:04 moria kernel: [ 263.833189] [] ? __do_softirq+0xec/0x2c0
Sep 5 09:56:04 moria kernel: [ 263.833192] [] ? do_softirq_own_stack+0x1c/0x30
Sep 5 09:56:04 moria kernel: [ 263.833192]
Sep 5 09:56:04 moria kernel: [ 263.833193] [] ? do_softirq+0x55/0x60
Sep 5 09:56:04 moria kernel: [ 263.833197] [] ? local_bh_enable_ip+0x84/0x90
Sep 5 09:56:04 moria kernel: [ 263.833199] [] ? _raw_spin_unlock_bh+0x1b/0x40
Sep 5 09:56:04 moria kernel: [ 263.833203] [] ? netlink_poll+0xf5/0x1e0
Sep 5 09:56:04 moria kernel: [ 263.833205] [] ? sock_poll+0x4c/0x100
Sep 5 09:56:04 moria kernel: [ 263.833208] [] ? SyS_epoll_ctl+0x53f/0x9e0
Sep 5 09:56:04 moria kernel: [ 263.833209] [] ? ep_poll_wakeup_proc+0x30/0x30
Sep 5 09:56:04 moria kernel: [ 263.833212] [] ? system_call_fastpath+0x1a/0x1f
Sep 5 09:56:04 moria kernel: [ 263.833212] Code: ff e8 92 b3 fc ff 0f b7 c3 5b c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 53 31 db 39 4a 10 76 0f 48 8b 12 c1 e9 02 89 c8 8b 1c 82 <89> d8 5b c3 31 ff e8 22 12 00 00 be 01 00 00 00 48 89 c2 31 ff
Sep 5 09:56:06 moria kernel: [ 265.432819] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Sep 5 09:56:08 moria kernel: [ 267.430868] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Sep 5 09:56:10 moria kernel: [ 269.428916] NVRM: os_schedule: Attempted to yield the CPU while in atomic or interrupt context
Sep 5 09:56:12 moria kernel: [ 271.427740] NVRM: ioctl(0x4a, 0xf0a11330, 0xa0)
Sep 5 09:56:12 moria kernel: [ 271.427822] NVRM: ioctl(0x2a, 0x1b6e1730, 0x20)
Sep 5 09:56:12 moria kernel: [ 271.428487] NVRM: VM: nv_alloc_pages: 256 pages
Sep 5 09:56:12 moria kernel: [ 271.428491] NVRM: VM: contig 0 cache_type 0
Sep 5 09:56:12 moria kernel: [ 271.428524] NVRM: VM: 2706796096: 256 pages
Sep 5 09:56:12 moria kernel: [ 271.428802] NVRM: VM: nv_alloc_pages:2244: 0xffff88021771ec00, 256 page(s), count = 1, flags = 0x00000011, page_table = 0xffff8802175bb008
Sep 5 09:56:12 moria kernel: [ 271.428848] NVRM: ioctl(0x4e, 0xf0a113a0, 0x30)
Sep 5 09:56:12 moria kernel: [ 271.428936] NVRM: ioctl(0x57, 0x1b6e16a0, 0x38)
Sep 5 09:56:12 moria kernel: [ 271.431143] NVRM: VM: nvidia_mmap:231: 0x204100000 - 0x204200000, 0x00100000 bytes @ 0x000000020ec9d000, 0x (null), 0xffff880036181500
Sep 5 09:56:12 moria kernel: [ 271.432085] NVRM: VM: nvidia_mmap:363: 0xffff88021771ec00, 256 page(s), count = 2, flags = 0x00000011, page_table = 0xffff8802175bb008
Sep 5 09:56:12 moria kernel: [ 271.432107] NVRM: ioctl(0x57, 0xf0a11280, 0x38)
Sep 5 09:56:12 moria kernel: [ 271.432337] NVRM: ioctl(0x4a, 0xf0a11330, 0xa0)
Sep 5 09:56:12 moria kernel: [ 271.432442] NVRM: VM: nv_alloc_pages: 256 pages
Sep 5 09:56:12 moria kernel: [ 271.432445] NVRM: VM: contig 0 cache_type 2
Sep 5 09:56:12 moria kernel: [ 271.432479] NVRM: VM: 2706796096: 256 pages
Sep 5 09:56:12 moria kernel: [ 271.432839] NVRM: ioctl(0x58, 0x1b6e1980, 0x28)
Sep 5 09:56:12 moria kernel: [ 271.436506] NVRM: VM: nv_alloc_pages:2244: 0xffff88021771e500, 256 page(s), count = 1, flags = 0x00020011, page_table = 0xffff8802175b9008
Sep 5 09:56:12 moria kernel: [ 271.436564] NVRM: ioctl(0x4e, 0xf0a113a0, 0x30)
Sep 5 09:56:12 moria kernel: [ 271.436633] NVRM: ioctl(0x58, 0x1b6e1950, 0x28)
Sep 5 09:56:12 moria kernel: [ 271.436718] NVRM: ioctl(0x29, 0x1b6e1970, 0x10)
Sep 5 09:56:12 moria kernel: [ 271.436741] NVRM: VM: nvidia_mmap:231: 0x7fb6a021d000 - 0x7fb6a031d000, 0x00100000 bytes @ 0x0000000209370000, 0x (null), 0xffff880036181500
Sep 5 09:56:12 moria kernel: [ 271.436935] NVRM: VM: nvidia_mmap:363: 0xffff88021771e500, 256 page(s), count = 2, flags = 0x00020011, page_table = 0xffff8802175b9008
Sep 5 09:56:12 moria kernel: [ 271.436943] NVRM: ioctl(0x57, 0xf0a11280, 0x38)
Sep 5 09:56:12 moria kernel: [ 271.437062] NVRM: ioctl(0x4a, 0xf0a11330, 0xa0)

Of course I have tried to run the nvidia driver augmented (the one I currently run has some debug statements in it). Will try to run the nvidia-bug-report script by ssh’ing into the machine when it happens (if I can)…
nvidia-bug-report.log.1.gz (307 KB)

Attached nvidia-bug-report to the above post