CUDA 3.2 Driver BROKE ? Oops....

Hi There!

I fear the worst – the CUDA 3.2 driver is “broke”. Platform: Linux, 32-bit, Ubuntu 9.04,

I am running a complex genetic algorithm on CUDA. We finished the project long time back - CUDA 2.3 times…

So, today I just re-compiled it on CUDA 3.2 installation and found that it does not return correct results anymore.
All results were zeroes… OMG! Unbelievable…
The same code ran rock solid during CUDA 2.3

So, Keeping the driver at 260.19 (the one that comes with CUDA 3.2 Linux 32-bit), I just changed the toolkit to CUDA 2.3.
No Change! The problem was persistent.

So, I downgraded the system to 190.53 driver (the one that comes with Linux CUDA 2.3) and then everything works!!

I think this problem could be related to what was posted in http://forums.nvidia.com/index.php?showtopic=186015

Is NVIDIA aware of this problem?
It would be difficult for me to get a bug-report on this. One because, genetic algorithms are very complex to control and debug. Other because, my bandwidth is limited. It would require hours together to produce a repro case,

Any help guys?,

Thanks,
Best Regards,
Sarnath

Hi There!

I fear the worst – the CUDA 3.2 driver is “broke”. Platform: Linux, 32-bit, Ubuntu 9.04,

I am running a complex genetic algorithm on CUDA. We finished the project long time back - CUDA 2.3 times…

So, today I just re-compiled it on CUDA 3.2 installation and found that it does not return correct results anymore.
All results were zeroes… OMG! Unbelievable…
The same code ran rock solid during CUDA 2.3

So, Keeping the driver at 260.19 (the one that comes with CUDA 3.2 Linux 32-bit), I just changed the toolkit to CUDA 2.3.
No Change! The problem was persistent.

So, I downgraded the system to 190.53 driver (the one that comes with Linux CUDA 2.3) and then everything works!!

I think this problem could be related to what was posted in The Official NVIDIA Forums | NVIDIA

Is NVIDIA aware of this problem?
It would be difficult for me to get a bug-report on this. One because, genetic algorithms are very complex to control and debug. Other because, my bandwidth is limited. It would require hours together to produce a repro case,

Any help guys?,

Thanks,
Best Regards,
Sarnath

That driver version most certainly isn’t broken. I have it in production with both the CUDA 2.3 and 3.2 toolkits on our cluster on a mixture of GT200 and GF100 cards and it works perfectly, including “legacy” code written in the pre-Fermi, pre-3.0 toolkit era.

Are you sure it isn’t just execution parameters? Could it be that by recompiling with the newer toolkit and compiler, the kernel register consumption has changed and kernels are no longer launching?

That driver version most certainly isn’t broken. I have it in production with both the CUDA 2.3 and 3.2 toolkits on our cluster on a mixture of GT200 and GF100 cards and it works perfectly, including “legacy” code written in the pre-Fermi, pre-3.0 toolkit era.

Are you sure it isn’t just execution parameters? Could it be that by recompiling with the newer toolkit and compiler, the kernel register consumption has changed and kernels are no longer launching?

Not really. I did compile with older toolkit and ran the executable on the latest driver. It failed. That is how I ruled out the “Toolkit”.

I just changed the driver and then re-ran the same executable and it ran fine…

We check errors for all the calls. In any case, I will recheck what you said. Thanks a LOT for your hints on a friday evening here!

Not really. I did compile with older toolkit and ran the executable on the latest driver. It failed. That is how I ruled out the “Toolkit”.

I just changed the driver and then re-ran the same executable and it ran fine…

We check errors for all the calls. In any case, I will recheck what you said. Thanks a LOT for your hints on a friday evening here!

FYI, the driver seems semi-broken to me too. I developed an application involving prime numbers, PSieve-CUDA, and with the 260 drivers I’ve gotten more and more reports of problems with the 260 drivers. I’m using them myself, and the app fails for me in some cases but not others. This app uses integers exclusively, so there’s no floating-point issue. It’s also embarassingly parallel, so it’s easy to adjust runtime parameters. It uses almost no memory (a few MB at most), and only registers and constants are accessed in the inner loop. And I always compile with the 2.3 SDK, so that version isn’t an issue.

In the first ranges I tested, the driver wasn’t a problem. I did notice some problem when I overloaded the card with about eight times as many CUDA threads as the GPU should need. For some reason, this makes the app run a bit faster. I have a couple of versions of the app, and for BOINC I have to disable PThreads, which otherwise allow running the app on multiple GPUs at the same time. When using PThreads I didn’t notice the problem either, and got the small speedup I was looking for. But without PThreads and with that overload I got computation errors (which means the GPU miscalculated something, and the CPU caught it.)

I’m going to try running the new range with PThreads tomorrow and see if it helps. But for now you might try decreasing your CUDA thread count, if you can. You might also try breaking up your computation into smaller (or perhaps larger) pieces per kernel run. I suspect that might be the cause of the newest errors, and I’m going to explore that tomorrow too.

FYI, the driver seems semi-broken to me too. I developed an application involving prime numbers, PSieve-CUDA, and with the 260 drivers I’ve gotten more and more reports of problems with the 260 drivers. I’m using them myself, and the app fails for me in some cases but not others. This app uses integers exclusively, so there’s no floating-point issue. It’s also embarassingly parallel, so it’s easy to adjust runtime parameters. It uses almost no memory (a few MB at most), and only registers and constants are accessed in the inner loop. And I always compile with the 2.3 SDK, so that version isn’t an issue.

In the first ranges I tested, the driver wasn’t a problem. I did notice some problem when I overloaded the card with about eight times as many CUDA threads as the GPU should need. For some reason, this makes the app run a bit faster. I have a couple of versions of the app, and for BOINC I have to disable PThreads, which otherwise allow running the app on multiple GPUs at the same time. When using PThreads I didn’t notice the problem either, and got the small speedup I was looking for. But without PThreads and with that overload I got computation errors (which means the GPU miscalculated something, and the CPU caught it.)

I’m going to try running the new range with PThreads tomorrow and see if it helps. But for now you might try decreasing your CUDA thread count, if you can. You might also try breaking up your computation into smaller (or perhaps larger) pieces per kernel run. I suspect that might be the cause of the newest errors, and I’m going to explore that tomorrow too.

I need actual repro cases before I can dig any deeper. If the driver was just “broken,” it never would have passed internal QA. Obviously, we can’t test everything, so maybe we missed something, but just saying that it doesn’t work doesn’t do any good. I need code that works with one driver but not another.

I need actual repro cases before I can dig any deeper. If the driver was just “broken,” it never would have passed internal QA. Obviously, we can’t test everything, so maybe we missed something, but just saying that it doesn’t work doesn’t do any good. I need code that works with one driver but not another.

Just as a side note, I experienced problems as well (my app running with previous drivers stable, but crashing with the newer ones) but it turned out that the reason was a bug in the program, which for some reason just didnt have any influence with the older drivers.

Cheers
Ceearem

Just as a side note, I experienced problems as well (my app running with previous drivers stable, but crashing with the newer ones) but it turned out that the reason was a bug in the program, which for some reason just didnt have any influence with the older drivers.

Cheers
Ceearem

i experienced lower performance on my homebrew FFT kernel using the driver two, rolled back to 3.0 and things went back to normal

i experienced lower performance on my homebrew FFT kernel using the driver two, rolled back to 3.0 and things went back to normal

Ceearem, could you give me an idea of what your bug was? It might help others if anyone else has the same problem.

As for my app, it seems that when it runs inside a separate PThread, there’s no problem. But this is very inconvenient in certain situations. Also, when the bug does appear, it’s not always in the same step of the calculation. This makes me suspect some kind of race condition.

I’ll post back when I have a verified case that passes on older drivers.

OK, I have a test case for you:

  1. Get TPSieve-CUDA. The source code is on that GitHub link I posted earlier, on the redc branch.
  2. On 64-bit Linux, run “./tpsieve-cuda-boinc-x86_64-linux -p420700e9 -P420701000e6 -k 1201 -K 9999 -N 3000000 -c 60 -M 2 -T -m 64 --device 0”. If it completes correctly it should print that it found 208 factors. If it fails, which it does on 260.19.* drivers, it won’t print that and will print a “computation error” message to stderr.txt.

I haven’t had errors on calculations, but I’ve noticed serious decrease in performance using CUSP compared to version 3.0. I describe the problem here:

http://forums.nvidia.com/index.php?showtopic=184785

Good news is that my fears were wrong!

Apologies to NVIDIA…

It was a “cudaMemcpy” bug that CUDA 2.3 did not complain about – meaning – there was a silent memory corruption!
The software did not have error-checks on that memcpy and few other places as well. (as was rightly pointed out by Avidday).

After fixing the bug, the results of the complex simulation are the same between driver vesions!

Sorry about that Tim,

Best Regards,
Sarnath

What kind of bug was it?

Several things, but mostly faulty memory accesses (i.e. also a memcpy with larger than allocated size etc.) which worked in earlier cuda versions [and I routinely run stuff on 3-4 different Linux Systems]. I think I remember also that there were out of bounds shared memory accesses. Btw. does someone know a good toolchain (or can hint me to a post for one) in linux to get find memory access errors, and memory leaks in an MPI bases multi GPU code. I mean there are bugs which might only occur when running more than 27 (3x3x3 grid) MPI processes (each with a seperate GPU hooked to it).

Cheers

Ceearem