soft lockup on CUDA4.0rc2

running one of my program under development, I got a stuck CUDA process, then after a while I found a lot of messages like below.

Is it worth a bug report?

BUG: soft lockup - CPU#1 stuck for 16s! [double_queue:1524]

CPU 1:

Modules linked in: nvidia(PU) nfs fscache nfs_acl autofs4 lockd sunrpc uio iw_cxgb3 cxgb3 cpufreq_ondemand acpi_cpufreq freq_table ib_srp rds ib_sdp ib_ipoib ipoib_helper rdma_ucm rdma_cm ib_ucm ib_uverbs ib_umad ib_cm iw_cm ib_addr ib_sa loop dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac ipv6 xfrm_nalgo crypto_api parport_pc lp parport mlx4_ib ib_mad ib_core mlx4_en joydev i2c_i801 igb 8021q mlx4_core serio_raw pcspkr shpchp dca i2c_core sg dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod usb_storage ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd

Pid: 1524, comm: double_queue Tainted: P      2.6.18-194.32.1.el5 #1

RIP: 0010:[<ffffffff896e10b3>]  [<ffffffff896e10b3>] :nvidia:_nv022936rm+0x20/0x22

RSP: 0018:ffff810433c638d0  EFLAGS: 00000202

RAX: 00000000ffffffff RBX: ffff8107846b7330 RCX: 0000000000000040

RDX: ffffc20011680000 RSI: ffff8107a89ee000 RDI: ffff8107fdf56000

RBP: ffff810433c63918 R08: 0000000000000050 R09: ffff81043d7b2b80

R10: ffff81043d7b2b40 R11: 0000000000000050 R12: ffff81083e9ac840

R13: 0000000010008040 R14: ffff81083e9ac840 R15: ffff81083e9ac840

FS:  00002ae37b7895e0(0000) GS:ffff81010ee99440(0000) knlGS:0000000000000000

CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b

CR2: 0000000000406000 CR3: 0000000000201000 CR4: 00000000000006e0

Call Trace:

 [<ffffffff89505f78>] :nvidia:_nv015002rm+0x148/0x190

 [<ffffffff89505fe1>] :nvidia:_nv003211rm+0x21/0x52

 [<ffffffff8950ce4d>] :nvidia:_nv015051rm+0x9e/0x14a

 [<ffffffff89506203>] :nvidia:_nv003209rm+0x72/0x119

 [<ffffffff89506306>] :nvidia:_nv014374rm+0x5c/0x71

 [<ffffffff895bdbf1>] :nvidia:_nv020040rm+0x243/0x9df

 [<ffffffff895a3a52>] :nvidia:_nv020037rm+0x4e/0x8c

 [<ffffffff894ef700>] :nvidia:_nv013071rm+0xfd/0x1d2

 [<ffffffff894ef5d6>] :nvidia:_nv013074rm+0x76/0xa3

 [<ffffffff894c4ea4>] :nvidia:_nv013077rm+0xd44/0x10cd

 [<ffffffff89206396>] :nvidia:_nv002388rm+0x404/0x485

 [<ffffffff89203f3c>] :nvidia:_nv003713rm+0x1cd/0x770

 [<ffffffff89203ee2>] :nvidia:_nv003713rm+0x173/0x770

 [<ffffffff89202bcf>] :nvidia:_nv003711rm+0xc7/0xef

 [<ffffffff89202c18>] :nvidia:_nv025316rm+0xe/0x13

 [<ffffffff892030d5>] :nvidia:_nv003722rm+0x111/0x49d

 [<ffffffff89202bcf>] :nvidia:_nv003711rm+0xc7/0xef

 [<ffffffff89202c18>] :nvidia:_nv025316rm+0xe/0x13

 [<ffffffff89202e4c>] :nvidia:_nv003717rm+0x1a8/0x320

 [<ffffffff89202bcf>] :nvidia:_nv003711rm+0xc7/0xef

 [<ffffffff89202c05>] :nvidia:_nv025318rm+0xe/0x13

 [<ffffffff89646778>] :nvidia:_nv025098rm+0x58/0x7b

 [<ffffffff896e3f3d>] :nvidia:_nv002329rm+0x144/0x18a

 [<ffffffff896e962c>] :nvidia:rm_disable_adapter+0x8b/0xdf

 [<ffffffff8970754b>] :nvidia:nv_kern_close+0x26b/0x410

 [<ffffffff80012ad9>] __fput+0xd3/0x1bd

 [<ffffffff80023c39>] filp_close+0x5c/0x64

 [<ffffffff80038f19>] put_files_struct+0x63/0xae

 [<ffffffff80015860>] do_exit+0x31c/0x911

 [<ffffffff800491a7>] cpuset_exit+0x0/0x88

 [<ffffffff8002b2ed>] get_signal_to_deliver+0x465/0x494

 [<ffffffff8005ada1>] do_notify_resume+0x9c/0x7af

 [<ffffffff89704a6a>] :nvidia:nv_kern_ioctl+0x382/0x393

 [<ffffffff80066b88>] do_page_fault+0x4fe/0x874

 [<ffffffff80062ff8>] thread_return+0x62/0xfe

 [<ffffffff800421d7>] do_ioctl+0x21/0x6b

 [<ffffffff8005d6dc>] retint_signal+0x3d/0x79

Yes, please file a bug and attach a repro if possible.

We encountered the same bug on Centos 5.5 (64 bit) with kernel 2.6.18-194.el5.

couldn’t reproduce it yet… sigh

mine is:

cat /proc/driver/nvidia/version

NVRM version: NVIDIA UNIX x86_64 Kernel Module 270.40 Sat Mar 26 13:00:34 PDT 2011

GCC version: gcc version 4.1.2 20080704 (Red Hat 4.1.2-50)

nvcc -V

nvcc: NVIDIA ® Cuda compiler driver

Copyright © 2005-2011 NVIDIA Corporation

Built on Sun_Mar_20_16:45:27_PDT_2011

Cuda compilation tools, release 4.0, V0.2.1221

cat /etc/redhat-release

CentOS release 5.6 (Final)

uname -a

Linux agape2 2.6.18-238.5.1.el5 #1 SMP Fri Apr 1 18:41:58 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux