Hi,
Sorry for the delay, it took me some time to find a different machine with power supply strong enough to take a GTX card.
Anyways, now I’ve tried the same experiment on a Gigabyte GA-P35-DS3P motherboard, a single 8800 GTX Ultra card, 650W PS, with everything else being the same as before (the OS, kernel, hard drives, etc.). The result is the same – the system lockups up. Furthermore, when I tried the same test using a low-end GeForce 8400 GS, the same kind of lockup happened. So this doesn’t seem tied to a specific card, nor a specific chipset/motherboard/BIOS.
Also, when I changed the script to launch 10 simultaneous convolutionFFT2D examples, just before the lockup the programs started failing with “cufft: ERROR: CUFFT_ALLOC_FAILED” error message (at config.cu, line 239).
So my best guess so far is that this bug is triggered when a) there are simultaneous CUDA apps running, b ) device memory is exhausted and c) the CPU has multiple cores. The only other thing I can imagine to try is to install SUSE and see if the problem is specific to RHEL<->NVIDIA combination, but I currently don’t have the resources (hard drives & time) to do so.
Once again, these tests involved running the following shell script
#!/bin/bash
for i in `seq 1 $1`; do ./MonteCarlo -noprompt & done;
and an example output (which ends when the machine locked up) is:
[root@r116239 release]# ./go.sh 10
[root@r116239 release]# Generating random options...
Generating random options...
Generating random options...
Generating random options...
Generating random options...
Generating random options...
Generating random options...
Generating random options...
Generating random options...
Generating random options...
Data init done.
Loading GPU twisters configurations...
Data init done.
Loading GPU twisters configurations...
Data init done.
Loading GPU twisters configurations...
Data init done.
Loading GPU twisters configurations...
Data init done.
Loading GPU twisters configurations...
Data init done.
Loading GPU twisters configurations...
Data init done.
Loading GPU twisters configurations...
RandomGPU()...
Generated samples : 80003072
RandomGPU() time : 547.806030
Samples per second: 1.460427E+08
BoxMullerGPU()...
Transformed samples : 80003072
BoxMullerGPU() time : 0.066000
Samples per second : 1.212168E+12
Starting Monte-Carlo simulation...
Options count : 128
Simulation paths : 80000000
Total GPU time : 9.436000
Options per second : 1.356507E+04
L1 norm: 1.000000e+00
Average reserve: 0.000000
TEST FAILED
Shutting down...
Data init done.
Loading GPU twisters configurations...
RandomGPU()...
I hope this helps in reproducing the problem.
PS: Regarding the original machine – it has 1kW, SLI-ready, power supply.