Linux Kernel Crashes under 260.19.21 Investigating Linux Kernel Crashes

Hi,

I am observing a pretty weird behavior on my computational server.

The server is a newly built box with a single GTX470 card. Linux kernel is 2.6.32-25 (ubuntu 10.04.1), x86_64, nvidia driver 260.19.21, CUDA-3.2. I believe I followed some popular instructions on how to install CUDA on top of this distro, which included removing and blacklisting the nvidia stuff, that came with the distro. I do not need to run X on this machine, as I mostly ssh into it.

The enclosure is huge and there are 7 fans in it, besides the one within the GPU. The CPU temperature is 50C and below, the GPU temperature is 77C, GPU fan speed is 48% (under load), as reported by nvidia-smi. All the components are brand new, MB runs the latest version of BIOS. I don’t see any hardware problem with my box.

The box is stable even under high load (a couple of yes>/dev/null, many postgres connections, etc), unless I run my GPU code on it.

My GPU code runs essentially one kernel very many times on a large number of blocks. This kernel is rather complex. There’s no watchdog timer issue, since I’m observing the same symptoms no matter whether X is running or stopped. The code does very a limited number of cudaMalloc()s, essentially only at startup, and then reuses this device memory. The code is fully user-space, doesn’t interact with any non-standard hardware besides the GPU, and the interaction with the GPU is limited to cudaMalloc, cudaFree and kernel launch (I think).

The interesting thing happens when I start two instances of my executable in parallel. This saves me about 20% of the wall clock time due to better utilization of the GPU. The two instances definitely do fit within the GPU RAM, and they run happily for some time, typically a few hours, producing correct results. Then the box just reboots. There is no record of anything suspicious within /var/log/kern.log, syslog or the like prior to the reboot. I installed mcelog to try to find something, but MCE doesn’t report anything whatsoever.

The box also reboots even when a single instance of my GPU executable is running, but this happens very infrequently, once in several days. This typically happens when the box is under heavy load, in addition to running my GPU code. Again, the same and greater kind of load with no GPU code hasn’t caused a crash yet.

Obviously, I’d like to debug this issue somehow, and I don’t know how to go about it. I can surround each kernel launch with a pair of debug messages, but it looks like an overkill: I’ll be flooded with these log messages and I cannot be sure the log message finds its way to the disk before the crash.

Is there any way to ask nvidia kernel driver to dump some kind of a panic message in case of a crash? Can I install an instrumented version of a driver, which might provide some clues?

Any clues would help.

Thanks!

EDIT: the GPU code also uses constant memory and textures, which Sarnath suggests might be relevant to the issue.

Oops… THats very bad.

My production level code that works fine on CUDA 2.3 does not work correctly with CUDA 3.2.
As far as I could find out, it is a “driver issue”.
Many others as well are facing different isues with this driver.

Is your code using textures?

The best bet is move out to 256.40 (CUDA 3.1) and earlier drivers. 3.2 requires 260.xx.
So, you may need to downgrade your CUDA version – possibly to CUDA 3.0 and try it out.
For me, a downgrade to 190.53 driver (CUDA 2.3) worked fine. I never tried 3.0
HTH,

Ha!

CUDA-2.3 is my tool set on a different box equipped with GTX280, and I have no issues with my GPU code at all on that box. Same code, different nvcc command line (no -gencode for Fermi, more registers per thread).

I don’t think I can use CUDA-2.3 on my Fermi card, though.

I would prefer to gather some proof that it’s in fact nVidia driver causing the issue. I don’t know how to collect this kind of evidence, as I have almost no kernel hacking experience, hence my post. At the moment all I have is just a suspicion, that nVidia driver is to blame, but not proof.

If I fail to collect the evidence, I’ll have to downgrade to 3.1, as you’re suggesting.

How did you guess? Yes, it does.

Do you mind describing the issues that you’ve discovered with this driver? Any pointers to other people’s issues you’ve mentioned?

Thanks!

Hi There! Cudesnick,

I am running a genetic algorithm. On the CUDA 3.2 driver, I get only zeroes as output for all my runs.
I use textures extensively. On CUDA 2.3, it used to run solid. I tested it by changing the toolkit alone
to CUDA 2.3 and letting the driver be the latest. No change in behavior!
Only when I downgraded the driver, I started getting correct results.

A colleague of mine had a problem with the driver too. He had a matlab installation on his personal laptop and compalined that the code that used to work before did not work with CUDA 3.2. He said when he changed the texture reference to shared memory, it started giving good results. I have asked him to characterize and post a repro to NVIDIA. Hopefully, before end of week.

There are many other in-correct and slow behaviour complaint against this driver in other parts of the forum,
NV is waiting for a good repro case yet. Some1 has promised to post. Hopefully, that will happen soon,

Thats all I have,
Good Luck,
BEst REgards,
Sarnath

Sarnath,

This is very helpful. I’ll consider getting out of texture business in my code, it looks like not too big of a deal. I’ll also consider switching back to CUDA-3.1 and earlier driver.

I’ll post here if there’s any progress.

Sure, Thanks! Let us see…if that helps.

I think I eliminated all the texture stuff from my code. As far as I can tell, the code doesn’t have any texture declarations in it.

My box still crashes…

I have a couple of theories, investigating. I’m still not 100% sure it’s nVidia driver that’s crashing my box. Unfortunately I haven’t heard from nVidia whether there’s a way to instrument the driver.

Oops… Sorry if it had stolen your time,

Could it be that the machine is taking an NMI or sthg? Is the RAM plugged in properly?

Good luck,

Best Regards,
Sarnath

Most certainly not. Moreover, I learned smth. new about my code: a small, frequently used lookup table behaves well on Fermi even if stored in device RAM, as it’s apparently being cached.

My current working hypothesis is that the crash happens when i) my cuda executable runs on a multi-core linux machine and ii) 20 postgres connections are opened at once to a database server that resides on that machine. Those can be either local connections, or they can originate from a remote host. In postgres world each new connection causes a fork of a dedicated process to serve that connection. Hence, in this setup 20 new processes start literally simultaneously. Similar connections running similar queries don’t crash my box, when my cuda executable isn’t running on it.

I’ve come up with a workaround, which seems to work so far: just sleep for a second between opening these connections. I’ll report how it goes.

Thanks! Good luck!

Certainly looks like a driver issue to my naive eyes…

So the status of my Linux kernel crash issue is as follows.

I’ve commented out all the texture-related stuff from my code. In addition to this I’m throttling heavy operations (such as database queries), so that there are fewer bursts of load on the machine.

The above actions seem to have helped quantitatively, but not qualitatively. The machine still crashes once in a while, but it does not happen that often, maybe once in a few days, rather than once in a few hours. I can run two instances of my code on my machine now and two more instances on a different machine, which talk to the “bad” machine as a database server. Roughly speaking, the project is unstuck, but light headache still remains.

Regarding textures: I have a feeling that commenting textures out was an important step towards making my machine more stable (thanks, Sarnath). The code seemed to be crashing more frequently, when the textures have been in there. Unfortunately, I don’t have enough time and resources right now to make accurate measurements of the texture effect (say, measure the frequency of crashes with and without texture code and identify which part of the texture code was causing the issue).

I intend to step my EVGA 470 up to 570 (almost free, given the price I bought my 470 for) and will report if the issue goes away with the new card.

Long ago (at least in CUDA time), I was warned against multiple programs using the same CUDA device due to driver bugs. Those bugs have been fixed (and I never had the problem to begin with), but I find it curious that your initial problem report seems to depend on multiple contexts running the card at the same time. Is one instance of your program completely stable, or just less likely to crash than two instances? If you run one instance, and then hammer the database, is it still OK?

When I run a single instance of my code and hammer the database server on the same machine real hard from a different machine, the code crashes.

As far as running two instances is concerned, I’ve seen the machine crashing when one of the instances hammers the database real hard. When I run just one instance of my code across all of my network, the database does not get hammered when my cuda kernel is running.

To work around the problem I throttled the hammering and removed the textures (reported in a greater detail in my previous email).

I am typically running two instances of a very similar code on my different server under cuda-2.3, old driver, GTX280 card. My code never crashes on that server and the server itself reboots only when I want it to. The only differences in my code itself is compile flags (architecture, maxregcount) and usage of textures in 2.3 code, as opposed to direct memory access in 3.2 code.

When CPU load affects machine stability, I immediately think of hardware problems, not drivers or toolkits or code.

And I especially think PSU. What are your machine specs, what is your exact PSU brand name and rating, and how many watts do you measure from the wall socket at full load?

Hi There, Cudesnick!

Ironically, the person who reported the texture problem to me told me today that it was a “programming mistake” on his part.
He was assuming that the first argument (offset arg) to cudaBindTexture() was zero (his code gets called from a 3rd party library).
But something had changed with the driver update. The 3rd party software does not call him with an offset of 0 anymore.

Moreover, my own problems with 3.2 were because of “invalid” cudaMemcpy() parameters. My code was reading invalid GPU memory locations – which was tolerated by CUDA 2.3 but not by CUDA 3.2.

:-(

Best Regards,
Sarnath

The PSU in my machine is 1200W Thermaltake, brand new. I was thinking about having it drive 2-3 more GPUs, and I have just one now. I haven’t done wall power measurements.

The MB is EVGA 4-way SLI Classified, the CPU is Intel i7, memory is Kingston HyperX.

The machine is stable when I run 8 instances of yes>/dev/null on its 4 cores. I didn’t have a chance to rerun this smoke test overnight to be sure, but I guess I will at some point.

That’s about the best PSU you can use! Excellent choice, I have settled on Toughpower only as well for my 3-GPU systems.

That still doesn’t mean it’s not a hardware problem…

An overnight test with nbody running on the GPU and some CPU load would be something to test with. I don’t know if “yes > /dev/null” is harsh on the CPU… but you could try some other tiny program. Two other GPU programs to burnin test would be Furmark (which is graphics, but that’s good for high power/heat) and CUDAmemcheck.

Fully agree. I’ve bought the GPU and the motherboard from the same vendor (EVGA) in order to avoid finger-pointing situation, if the system happens to be unstable.

I haven’t overclocked anything and kept every BIOS setting default.

I’d like to do more stability tests and perhaps roll back to an older driver, but simply don’t have time now: I want to keep my code running even though infrequent (roughly once a week) reboots are a bit of an inconvenience.

Ok, I have finally run a rather scientific experiment to confirm my theory.

I have run two tests:

Test 1:

  • The machine that I have seen crashing a lot runs NO GPU code, but does run 8 instances of yes>/dev/null and a postgres server. This is, I would say, a fairly standard way of loading the CPU on a Unix box: this program keeps printing character “y” to an output stream, which discards everything it gets.

  • Another machine opens many (up to about 60) postgres connections to the first machine. The connections run fairly different queries. Most of them insert a few rows into tables, some of them query for about a thousand rows at a time and insert those into the other tables.

This test has been running for about five hours. The CPU temp went up to 72degC, the load average was above 10 most of the time. The machine was perfectly stable.

Test 2:

Same as test 1, but the machine I’ve seen crashing runs my GPU code and no yes>/dev/null.

The machine rebooted in five minutes after I’ve started test 2.

I think I have a rather solid evidence, that it’s my GPU code, that causes the machine crash, while a heavier load exerted by other kinds of software does NOT cause any crash. I don’t know whether it’s a hardware or a driver issue, but I’m now rather certain it’s related to the GPU.

OK, I found a very simple way of rebooting my box with a piece of CUDA code.

I’m running an example from the SDK in a bash loop, and this causes the reboot of my box. Besides this example code, the box is running standard Ubuntu-10.04 stuff and none of my custom code. Postgres server is not being queried.

I’ve added three printouts to the example `~/CUDA-SDK-3_2/C/src/BlackScholes/BlackSholes.cu’, in order to see what part of the executable causes the reboot. My additional code is easy to spot: it’s not indented, and the printouts start with “CUDESNICK:”.

shrLog("CUDESNICK: Starting iterations\n");

        for(i = 0; i < NUM_ITERATIONS; i++){

            BlackScholesGPU<<<480, 128>>>(

                d_CallResult,

                d_PutResult,

                d_StockPrice,

                d_OptionStrike,

                d_OptionYears,

                RISKFREE,

                VOLATILITY,

                OPT_N

            );

            cutilCheckMsg("BlackScholesGPU() execution failed\n");

        }

shrLog("CUDESNICK: Loop completed\n");

        cutilSafeCall( cudaThreadSynchronize() );

shrLog("CUDESNICK: Synchronized\n");

I then compiled the example by running make' within directory ~/CUDA-SDK-3_2/C/src/BlackScholes/'.

One can see from the following printout that the system crashes while the code is within cudaThreadSynchronize().

~/CUDA-SDK-3_2/C/bin/linux/release>for i in `seq 1 100`; do echo "CUDESNICK-SHELL: Invoking BlackSholes for the ${i}th time"; ./BlackScholes --noprompt; done

CUDESNICK-SHELL: Invoking BlackSholes for the 1th time

[BlackScholes]

./BlackScholes Starting...

Initializing data...

...allocating CPU memory for options.

...allocating GPU memory for options.

...generating input data in CPU mem.

...copying input data to GPU mem.

Data init done.

Executing Black-Scholes GPU kernel (512 iterations)...

CUDESNICK: Starting iterations

CUDESNICK: Loop completed

CUDESNICK: Synchronized

Options count             : 8000000     

BlackScholesGPU() time    : 4.237635 msec

Effective memory bandwidth: 18.878456 GB/s

Gigaoptions per second    : 1.887846     

BlackScholes, Throughput = 1.8878 GOptions/s, Time = 0.00424 s, Size = 8000000 options, NumDevsUsed = 1, Workgroup = 128

Reading back GPU results...

Checking the results...

...running CPU calculations.

Comparing the results...

L1 norm: 1.691779E-07

Max absolute error: 1.239777E-05

Shutting down...

...releasing GPU memory.

...releasing CPU memory.

Shutdown done.

[BlackScholes] - Test Summary

PASSED

./BlackScholes Exiting...

-----------------------------------------------------------

CUDESNICK-SHELL: Invoking BlackSholes for the 2th time

[BlackScholes]

./BlackScholes Starting...

Initializing data...

...allocating CPU memory for options.

...allocating GPU memory for options.

...generating input data in CPU mem.

...copying input data to GPU mem.

Data init done.

Executing Black-Scholes GPU kernel (512 iterations)...

CUDESNICK: Starting iterations

CUDESNICK: Loop completed

CUDESNICK: Synchronized

Options count             : 8000000     

BlackScholesGPU() time    : 4.236279 msec

Effective memory bandwidth: 18.884495 GB/s

Gigaoptions per second    : 1.888450     

BlackScholes, Throughput = 1.8884 GOptions/s, Time = 0.00424 s, Size = 8000000 options, NumDevsUsed = 1, Workgroup = 128

Reading back GPU results...

Checking the results...

...running CPU calculations.

Comparing the results...

L1 norm: 1.691779E-07

Max absolute error: 1.239777E-05

Shutting down...

...releasing GPU memory.

...releasing CPU memory.

Shutdown done.

[BlackScholes] - Test Summary

PASSED

./BlackScholes Exiting...

-----------------------------------------------------------

CUDESNICK-SHELL: Invoking BlackSholes for the 3th time

[BlackScholes]

./BlackScholes Starting...

Initializing data...

...allocating CPU memory for options.

...allocating GPU memory for options.

...generating input data in CPU mem.

...copying input data to GPU mem.

Data init done.

Executing Black-Scholes GPU kernel (512 iterations)...

CUDESNICK: Starting iterations

CUDESNICK: Loop completed

CUDESNICK: Synchronized

Options count             : 8000000     

BlackScholesGPU() time    : 4.235482 msec

Effective memory bandwidth: 18.888050 GB/s

Gigaoptions per second    : 1.888805     

BlackScholes, Throughput = 1.8888 GOptions/s, Time = 0.00424 s, Size = 8000000 options, NumDevsUsed = 1, Workgroup = 128

Reading back GPU results...

Checking the results...

...running CPU calculations.

Comparing the results...

L1 norm: 1.691779E-07

Max absolute error: 1.239777E-05

Shutting down...

...releasing GPU memory.

...releasing CPU memory.

Shutdown done.

[BlackScholes] - Test Summary

PASSED

./BlackScholes Exiting...

-----------------------------------------------------------

CUDESNICK-SHELL: Invoking BlackSholes for the 4th time

[BlackScholes]

./BlackScholes Starting...

Initializing data...

...allocating CPU memory for options.

...allocating GPU memory for options.

...generating input data in CPU mem.

...copying input data to GPU mem.

Data init done.

Executing Black-Scholes GPU kernel (512 iterations)...

CUDESNICK: Starting iterations

CUDESNICK: Loop completed

CUDESNICK: Synchronized

Options count             : 8000000     

BlackScholesGPU() time    : 4.237184 msec

Effective memory bandwidth: 18.880466 GB/s

Gigaoptions per second    : 1.888047     

BlackScholes, Throughput = 1.8880 GOptions/s, Time = 0.00424 s, Size = 8000000 options, NumDevsUsed = 1, Workgroup = 128

Reading back GPU results...

Checking the results...

...running CPU calculations.

Comparing the results...

L1 norm: 1.691779E-07

Max absolute error: 1.239777E-05

Shutting down...

...releasing GPU memory.

...releasing CPU memory.

Shutdown done.

[BlackScholes] - Test Summary

PASSED

./BlackScholes Exiting...

-----------------------------------------------------------

CUDESNICK-SHELL: Invoking BlackSholes for the 5th time

[BlackScholes]

./BlackScholes Starting...

Initializing data...

...allocating CPU memory for options.

...allocating GPU memory for options.

...generating input data in CPU mem.

...copying input data to GPU mem.

Data init done.

Executing Black-Scholes GPU kernel (512 iterations)...

CUDESNICK: Starting iterations

CUDESNICK: Loop completed

CUDESNICK: Synchronized

Options count             : 8000000     

BlackScholesGPU() time    : 4.234685 msec

Effective memory bandwidth: 18.891604 GB/s

Gigaoptions per second    : 1.889160     

BlackScholes, Throughput = 1.8892 GOptions/s, Time = 0.00423 s, Size = 8000000 options, NumDevsUsed = 1, Workgroup = 128

Reading back GPU results...

Checking the results...

...running CPU calculations.

Comparing the results...

L1 norm: 1.691779E-07

Max absolute error: 1.239777E-05

Shutting down...

...releasing GPU memory.

...releasing CPU memory.

Shutdown done.

[BlackScholes] - Test Summary

PASSED

./BlackScholes Exiting...

-----------------------------------------------------------

CUDESNICK-SHELL: Invoking BlackSholes for the 6th time

[BlackScholes]

./BlackScholes Starting...

Initializing data...

...allocating CPU memory for options.

...allocating GPU memory for options.

...generating input data in CPU mem.

...copying input data to GPU mem.

Data init done.

Executing Black-Scholes GPU kernel (512 iterations)...

CUDESNICK: Starting iterations

CUDESNICK: Loop completed

CUDESNICK: Synchronized

Options count             : 8000000     

BlackScholesGPU() time    : 4.235549 msec

Effective memory bandwidth: 18.887752 GB/s

Gigaoptions per second    : 1.888775     

BlackScholes, Throughput = 1.8888 GOptions/s, Time = 0.00424 s, Size = 8000000 options, NumDevsUsed = 1, Workgroup = 128

Reading back GPU results...

Checking the results...

...running CPU calculations.

Comparing the results...

L1 norm: 1.691779E-07

Max absolute error: 1.239777E-05

Shutting down...

...releasing GPU memory.

...releasing CPU memory.

Shutdown done.

[BlackScholes] - Test Summary

PASSED

./BlackScholes Exiting...

-----------------------------------------------------------

CUDESNICK-SHELL: Invoking BlackSholes for the 7th time

[BlackScholes]

./BlackScholes Starting...

Initializing data...

...allocating CPU memory for options.

...allocating GPU memory for options.

...generating input data in CPU mem.

...copying input data to GPU mem.

Data init done.

Executing Black-Scholes GPU kernel (512 iterations)...

CUDESNICK: Starting iterations

CUDESNICK: Loop completed

CUDESNICK: Synchronized

Options count             : 8000000     

BlackScholesGPU() time    : 4.235885 msec

Effective memory bandwidth: 18.886255 GB/s

Gigaoptions per second    : 1.888626     

BlackScholes, Throughput = 1.8886 GOptions/s, Time = 0.00424 s, Size = 8000000 options, NumDevsUsed = 1, Workgroup = 128

Reading back GPU results...

Checking the results...

...running CPU calculations.

Comparing the results...

L1 norm: 1.691779E-07

Max absolute error: 1.239777E-05

Shutting down...

...releasing GPU memory.

...releasing CPU memory.

Shutdown done.

[BlackScholes] - Test Summary

PASSED

./BlackScholes Exiting...

-----------------------------------------------------------

CUDESNICK-SHELL: Invoking BlackSholes for the 8th time

[BlackScholes]

./BlackScholes Starting...

Initializing data...

...allocating CPU memory for options.

...allocating GPU memory for options.

...generating input data in CPU mem.

...copying input data to GPU mem.

Data init done.

Executing Black-Scholes GPU kernel (512 iterations)...

CUDESNICK: Starting iterations

CUDESNICK: Loop completed

CUDESNICK: Synchronized

Options count             : 8000000     

BlackScholesGPU() time    : 4.236287 msec

Effective memory bandwidth: 18.884461 GB/s

Gigaoptions per second    : 1.888446     

BlackScholes, Throughput = 1.8884 GOptions/s, Time = 0.00424 s, Size = 8000000 options, NumDevsUsed = 1, Workgroup = 128

Reading back GPU results...

Checking the results...

...running CPU calculations.

Comparing the results...

L1 norm: 1.691779E-07

Max absolute error: 1.239777E-05

Shutting down...

...releasing GPU memory.

...releasing CPU memory.

Shutdown done.

[BlackScholes] - Test Summary

PASSED

./BlackScholes Exiting...

-----------------------------------------------------------

CUDESNICK-SHELL: Invoking BlackSholes for the 9th time

[BlackScholes]

./BlackScholes Starting...

Initializing data...

...allocating CPU memory for options.

...allocating GPU memory for options.

...generating input data in CPU mem.

...copying input data to GPU mem.

Data init done.

Executing Black-Scholes GPU kernel (512 iterations)...

CUDESNICK: Starting iterations

CUDESNICK: Loop completed

The box reboots right after the last message of the above log.

The number of invocations of the SDK example before the system reboot varies, but I haven’t seen it go above twelve.