cuda 2.2 bug?

gshi · May 19, 2009, 10:16pm

Hi,

We have been experience some problem with S1070 GPUs in our cluster since we installed cuda 2.2. The problem is easily reproducible with the attached code. What the code does is to launch read and write kernels back to back in the same stream. Easy time the kernel will read some memory location and compare it with the expected pattern, record error if they do not match, write the compliment of the pattern to the memory location. This program will cause the GPU to bad state if it is killed in the middle of execution (e.g. ctrl + c). By bad state, I mean

either the next GPU program will hang on cudaMalloc() function call.
or the the order of kernel execution in the stream seems to be broken. For example, if we ran the test program, we would get lots of memory errors. A careful examination shows the expected value and the memory value are exact opposite in those erroneous memory locations, indicating the write of the previous kernel has not reached the memory yet.

Previously we have cuda 2.1 in our cluster and it was fine. Only with cuda 2.2 that we saw this problem. We found a reboot or reloading of nvidia module would fix the problem and put GPUs into clean state.

We are using Fedora 9, with kernel 2.6.27.21-78.2.41.fc9.x86_64, with driver Kernel Module 185.18.08
I will be happy to provide more information if needed.

thanks
-gshi

UPDATE: we tested old driver 180.51 with cuda 2.2 and that seems to make the problem go away. So the “bug” seems to be in the driver.

erdooom · May 21, 2009, 11:21pm

thats interesting, im getting very funny behavior havening to do with the same type of operations, i couldn’t isolate a simple case so i can’t post an example … but now reading your post i will try down grading my driver to the one that came out with cuda 2.2 or 2.2 beta.

gshi · May 22, 2009, 2:58pm

erdooom, Thanks for replying .

We have users reporting that they got wrong answers in the bad-state GPU node.

I would appreciate it if Nvidia could address this issue.

thanks

-gshi

tmurray · May 22, 2009, 9:52pm

Looking into this now–looks like I can replicate it, trying to figure out what’s going on.

SPWorley · May 22, 2009, 10:18pm

Tim, I know it’s your job, but man, you’re awesome at jumping onto issues like this. Thanks from all us users! It definitely helps keep the CUDA environment easier to swim in knowing there’s active and responsive support like yours. Thanks!

gshi · May 26, 2009, 2:51pm

That’s great. Thanks for looking into this.

-gshi

Jeremy_Enos · July 2, 2009, 12:55am

Is there any update here? For the full duration of the 185 series driver, I’ve been checking the health state after each of my users runs an application. For some applications, this isn’t a problem at all. For others, intermittent. And some applications consistently leave the GPU in a bad state, meaning that I have to tell those users that all their application results cannot be trusted. I’d just back off to 2.1/180.51 but some users are already dependent on the 2.2 feature set.

When can we expect a fix here? I’m desperate.

thx-

Jeremy Enos

NCSA

kalman · July 3, 2009, 8:47am

I believe NVIDIA employers watching this forum are overwhelmed and not able to reply, at same time we are lost, not knowing if the problem present in 2.2 is solved in 2.3 or not, when 2.3 will be out. As another example of the load they are having is the fact that I’m trying to register as developer since pre 2.2 but nothing yet.

Jeremy_Enos · July 3, 2009, 11:54pm

FYI- CUDA 2.3/190.09 combination suffers this same issue. I’d hate to see a 190 series release w/o this fixed. It would be best of course to see both 185 and 190 series get fixed.

gshi · July 22, 2009, 12:16am

some update:

with the new released cuda 2.3 and driver 190.16, the issue remains the same

tmurray · July 22, 2009, 12:23am

Yeah, I know–we’ve found the problem, should have a fix out in the next release. (I tested it today on your app)

gshi · July 22, 2009, 5:44pm

Thanks Tim, I am looking forward to the new release

-gshi

Jeremy_Enos · July 22, 2009, 8:27pm

Thanks Tim-

When you say “next release”, you don’t mean 2.4 do you? I was hoping to see a driver update that resolved this… it’s plaguing us terribly. We’re having to tell lots of HPC users that their results can’t be trusted as a result.

Jeremy

tmurray · July 22, 2009, 8:38pm

There should be another 190.xx (and maybe another 185.xx) release that contains the fix.

gshi · August 25, 2009, 5:04pm

some update:

We found the driver 185.18.31 with cuda 2.2 fixed the problem.

http://www.nvidia.com/object/linux_display…_185.18.31.html

Thanks a lot for the driver fix.

However, the latest driver with cuda 2.3 is still broken ( we have postponed 2.3 deployment in our cluster for this reason).

It would be great if the problem can be fixed in the latest driver too.

tmurray · August 25, 2009, 5:39pm

Looks like 190.xx is still officially beta for Linux. The first “stable” release should fix this (I’ll double check to make sure that everything made it into 190).

kalman · August 26, 2009, 12:47am

Do you have an ETA for this driver being pushed out of beta?

tmurray · August 26, 2009, 12:53am

Looking for one.

tmurray · August 26, 2009, 7:55am

oh, I’m a dope, apparently it came out four days ago

[url=“http://ftp%3a%2f/download.nvidia.com/XFree86/Linux-x86/190.25/NVIDIA-Linux-x86-190.25-pkg0.run”]ftp://download.nvidia.com/XFree86/Linux-x...190.25-pkg0.run[/url]
[url=“http://ftp%3a%2f/download.nvidia.com/XFree86/Linux-x86_64/190.25/NVIDIA-Linux-x86_64-190.25-pkg0.run”]ftp://download.nvidia.com/XFree86/Linux-x...190.25-pkg0.run[/url]

(top is x86, bottom is x86-64)

kalman · August 26, 2009, 9:37am

I’m so glad that who is making possible for us using such nice tool (CUDA) is not the same people who is managing the

NVIDIA site because otherwise we would have been still using CPUs instead of GPUs.

A poor newbie user clicking on: GET CUDA will still download old drivers in beta.

on that location (for X86-64 at least) there are 3 file: pkg0/pkg1/pkg2 wich one is good for RH5.3 ?

Thank you for the links.

Topic		Replies	Views
CUDA 3.2 Driver BROKE ? Oops.... CUDA Programming and Performance	20	11567	December 22, 2010
S1070 device 0 broken Test case provided CUDA Programming and Performance	10	4422	June 9, 2009
CUDA Toolkit and SDK v2.2 released CUDA Programming and Performance	59	65292	January 25, 2011
GPU in state where results are not reproducible! CUDA Programming and Performance	50	17325	November 2, 2012
CUDA Toolkit and SDK 2.3 released CUDA Programming and Performance	127	320637	November 3, 2009
CUDA 4 + driver 270.35 (C2050) random errors CUDA Programming and Performance	13	18782	April 7, 2011
CUDA 3.2 on GTX 480 is "busy or unavailable" CUDA Programming and Performance	19	73620	March 21, 2011
CUDA 2.1 discussion CUDA Programming and Performance	71	64450	February 17, 2009
Stability Problem CUDA Programming and Performance	12	4126	February 4, 2011
185.18.10 CUDA does not work, 180.X sires sorta works.... CUDA Programming and Performance	55	22916	July 31, 2009

cuda 2.2 bug?

Related topics