Multiple simultaneous CUDA applications (system crash on 100.14.11)

Hi everyone,

According to CUDA FAQ (question #26), it should be possible to run multiple CUDA applications at the same time (e.g., on a multiuser machine where multiple users may wish to use the GPU).

However, I’ve experienced crashes when attempting to do so. For example, the following:

for i in `seq 1 10`; do ./MonteCarlo -noprompt & done

reproducibly crashes our machine (MonteCarlo is one of the SDK examples).

The error written out on the console upon crash is (and I hope I copied it down right):

NVRM: Xid (0001:00): 13, 0004 00000000 000050c0 00000368 00000000 00000100

This is a Core 2 Quad Q6700 system, 4GB RAM, n680i chipset (an Alianware board, but looks like a rebranded NVIDIA design as far as I could tell), with two 8800 GTX Ultra (756MB RAM) cards, running 32-bit RHEL5:

[root@cylon release]# uname -a

Linux cylon.sns.ias.edu 2.6.18-8.1.10.el5 #1 SMP Thu Sep 13 12:38:04 EDT 2007 i686 i686 i386 GNU/Linux

[root@cylon release]# dmesg | grep NVRM

NVRM: loading NVIDIA UNIX x86 Kernel Module  100.14.11  Wed Jun 13 18:21:22 PDT 2007

relevant grub.conf kernel settings:

uppermem 524288

kernel /vmlinuz-2.6.18-8.1.10.el5 ro root=LABEL=/ rhgb quiet vmalloc=256MB pci=nommconf iommu=soft

Is this a known issue, or have we stumbled upon something that should not be happening?

Cheers,

Mario

Could you try to repeat the test with no X11 running?
Just do an “init 3”.

I tried turning off X, and the outcome was the same in 100% of cases (the above error message, followed by a system lockup). The error message is always nearly identical, the only difference being the second 4-digit number (it varies between 0001 and 0004 (is it the CPU/core id?)).

I’ve also tried with a PAE-enabled kernel (2.6.18-8.1.10.el5PAE), with the same result.

No, you should not normally ever experience an OS hang. I have a few questions:
0) When you say the system is crashing, do you mean the entire OS locks up, or just X is hanging?

  1. Does this problem also occur if you upgrade to the 100.14.19 display driver:
    http://www.nvidia.com/object/unix.html
  2. Does this problem persist if you remove these options from your GRUB configuration (and reboot):
    rhgb quiet iommu=soft
  3. Please verify that you’re using the latest motherboard BIOS.
  4. Please generate and attach an nvidia-bug-report.log

thanks,
Lonni

Yes, hard OS lockup (the computer has to be restarted using the power button).

Yes. Also, after installing this version of the driver, I get the following warnings when running CUDA applications (but they seem to run fine):

NVRM: API mismatch: the client has the version 100.14.11, but

NVRM: this kernel module has the version 100.14.19.  Please

NVRM: make sure that this kernel module and all NVIDIA driver

NVRM: components have the same version.

Yes.

We’ve checked with Alienware this morning – it is their latest BIOS (the date is July 3rd, 2007).

Attached. X were running when it was generated, but this bug occurs irrespective of whether or not a CUDA app. is run in X, and irrespective of whether or not X were ever started (e.g., I rebooted the system in runlevel 3 and without rhgb, and reproduced the same lockup).

PS: Also, maybe related and maybe not, yesterday when I ran a CUDA app I got this message on the console:

BUG: soft lockup detected on CPU#0!

 [<c044a05f>] softlockup_tick+0x98/0xa6

 [<c042ccd4>] update_process_times+0x39/0x5c

 [<c04176ec>] smp_apic_timer_interrupt+0x5c/0x64

 [<c04049bf>] apic_timer_interrupt+0x1f/0x24

 [<f12c0ac9>] _nv000310rm+0x71/0x9a [nvidia]

 [<f12c0848>] _nv004103rm+0x165/0x1ad [nvidia]

 [<f12b531e>] _nv003568rm+0x8a/0x2a9 [nvidia]

 [<f12ba6f2>] _nv002666rm+0x258/0x450 [nvidia]

 [<f12bad79>] _nv002802rm+0x37f/0x59f [nvidia]

 [<f12b7e4d>] _nv002746rm+0x97/0xe7 [nvidia]

 [<f12bc19f>] rm_disable_adapter+0x7b/0xc9 [nvidia]

 [<f15ab834>] nv_kern_close+0x1ed/0x347 [nvidia]

 [<c046b607>] __fput+0x9c/0x167

 [<c04690f3>] filp_close+0x4e/0x54

 [<c042512c>] put_files_struct+0x65/0xa7

 [<c04260da>] do_exit+0x229/0x746

 [<c042666d>] sys_exit_group+0x0/0xd

 [<c0403eff>] syscall_call+0x7/0xb

 =======================

but it didn’t lock up the system. I couldn’t reproduce it later on.
nvidia_bug_report.log.gz (44.1 KB)

Specifically which model of Alienware system is this?

Also, does the problem persist if you have less than 4GB of RAM in the system or only one GeForce 8800 ?

thanks,
Lonni

  • Specifically which model of Alienware system is this?

This is Area-51 7500-R5. And the motherboard is ‘EVGA NForce 680i SLI MB Rev D’.

  • Also, does the problem persist if you have less than 4GB of RAM in the system or only one GeForce 8800 ?

Yes. I tried with 2GB RAM + 2 cards, and 4GB RAM + single card, both combinations crashed.

However, when I disabled three out of four CPU cores (in the BIOS), the lockups stopped! When I do my simultaneous applications test as described in the opening post, I still get failures and errors as before, but no luckups.

I’m attaching the bug-report.log, and the output (with FAILUREs) of the MonteCarlo example.
failed_single_card_single_core.txt.gz (883 Bytes)
nvidia_bug_report_failed_in_X_single_core_card.log.gz (32.9 KB)

What about your power supply?

Can you try to reduce the number of samples in the Montecarlo ( right now is using 400MB for each instance)?

Unfortunately, I don’t have one of these Alienware systems here, or I’d attempt to reproduce this. I can state that this failure does not occur on any of the other systems that I have here (an assortment of different workstations from different vendors). This leads me to believe that this might be a motherboard BIOS issue, however without being able to replicate the failure, I’m just guessing.

Do you have any other systems where you could test this besides the one Alienware model?

Hi,

Sorry for the delay, it took me some time to find a different machine with power supply strong enough to take a GTX card.

Anyways, now I’ve tried the same experiment on a Gigabyte GA-P35-DS3P motherboard, a single 8800 GTX Ultra card, 650W PS, with everything else being the same as before (the OS, kernel, hard drives, etc.). The result is the same – the system lockups up. Furthermore, when I tried the same test using a low-end GeForce 8400 GS, the same kind of lockup happened. So this doesn’t seem tied to a specific card, nor a specific chipset/motherboard/BIOS.

Also, when I changed the script to launch 10 simultaneous convolutionFFT2D examples, just before the lockup the programs started failing with “cufft: ERROR: CUFFT_ALLOC_FAILED” error message (at config.cu, line 239).

So my best guess so far is that this bug is triggered when a) there are simultaneous CUDA apps running, b ) device memory is exhausted and c) the CPU has multiple cores. The only other thing I can imagine to try is to install SUSE and see if the problem is specific to RHEL<->NVIDIA combination, but I currently don’t have the resources (hard drives & time) to do so.

Once again, these tests involved running the following shell script

#!/bin/bash

for i in `seq 1 $1`; do ./MonteCarlo -noprompt & done;

and an example output (which ends when the machine locked up) is:

[root@r116239 release]# ./go.sh 10

[root@r116239 release]# Generating random options...

Generating random options...

Generating random options...

Generating random options...

Generating random options...

Generating random options...

Generating random options...

Generating random options...

Generating random options...

Generating random options...

Data init done.

Loading GPU twisters configurations...

Data init done.

Loading GPU twisters configurations...

Data init done.

Loading GPU twisters configurations...

Data init done.

Loading GPU twisters configurations...

Data init done.

Loading GPU twisters configurations...

Data init done.

Loading GPU twisters configurations...

Data init done.

Loading GPU twisters configurations...

RandomGPU()...

Generated samples : 80003072

RandomGPU() time  : 547.806030

Samples per second: 1.460427E+08

BoxMullerGPU()...

Transformed samples : 80003072

BoxMullerGPU() time : 0.066000

Samples per second  : 1.212168E+12

Starting Monte-Carlo simulation...

Options count      : 128

Simulation paths   : 80000000

Total GPU time     : 9.436000

Options per second : 1.356507E+04

L1 norm: 1.000000e+00

Average reserve: 0.000000

TEST FAILED

Shutting down...

Data init done.

Loading GPU twisters configurations...

RandomGPU()...

I hope this helps in reproducing the problem.

PS: Regarding the original machine – it has 1kW, SLI-ready, power supply.

Can you try to decrease the option count? It is 80M, I would try 8M.
Right now this example allocates more than 400MB of memory on the GPU.

I believe that I’ve reproduced this problem here. I had initially misunderstood what you were reporting as running the same test multiple times in serial, rather than in parallel. In my testing, running 3 or 4 simultaneous iterations does not result in any instability, however with 5 or more, my system experiences an MCE (Machine Check Exception) and hangs.

However, it would be helpful if you were able to setup a serial console or netconsole to capture the crash kernel messages so that we could be confident that we’re both hitting the same exact crash. Additionally, can you confirm the minimum number of simultaneous instances needed to cause the crash on your system?

I’ve opened bug 354801.

thanks,
Lonni

Hi,
On our machine, two simultaneous applications do not cause a lockup (I did 10 tests, all 10 were successful), but three or more do. With three, on first try, MonteCarlo gave me a TEST FAILED message and the NVRM console message, but did not lock up the machine; however, on second try the machine locked up. With four, the machine locked up on first try.

 I tried kernel logging with netconsole, but got no additional messages at lockup compared to what was printed out on the physical console (i.e., I didn't get an MCE message). I'm not sure if I need to increase the kernel verbosity level (compared to RHEL's default) to get these?

I also tried reducing the option count to 8M. That wasn't able to crash the machine with 10 simultaneous instances (and all of them ran successfully, i.e., TEST PASSED), but when I ran 15, the machine locked up.

Cheers,
Mario

Well, I would like to raise the topic again because I experienced the same problem few days ago.

I am not sure if this is a kernel problem or a CUDA driver problem. My setup is Opteron 22162, S2915, 8600GTS4, with Toughpower 1200W PSU.

Ubuntu 7.04 with default kernel (need some tune-up?) and all CUDA 1.0 SDK/drivers.

When I run bandwidthTest on card 1, 3 and scanLargeArray on card 2, 4 simutaneously, it almost crashed the system immediately. (I modified the source code to set the device id at start-up, afer CUDA_INIT(), and wrote a small script to lauch bandwidthTest/scanLargeArray for given times on given GPU)

The bandwidthTest even gave error result (unreasonable bandwidth number)

Any comment on this? Anyone tried similiar setup on 64bit Linux and similiar test?

(Or maybe hardware problem?)

Thanks!

Best,

MuChi

I don’t have your exact setup, but I’ve had trouble with the S2915 motherboard when using the PCIe x16 and x8 slots that are connected to CPU1. When I use both slots simultaneously, the system often locks up. Using them one at a time seems to be okay.

Perhaps you could avoid one of those slots. Run your tests with 2-3 GPUs to see what you get.