cuda 2.2 bug?

Hi,

We have been experience some problem with S1070 GPUs in our cluster since we installed cuda 2.2. The problem is easily reproducible with the attached code. What the code does is to launch read and write kernels back to back in the same stream. Easy time the kernel will read some memory location and compare it with the expected pattern, record error if they do not match, write the compliment of the pattern to the memory location. This program will cause the GPU to bad state if it is killed in the middle of execution (e.g. ctrl + c). By bad state, I mean

  1. either the next GPU program will hang on cudaMalloc() function call.
  2. or the the order of kernel execution in the stream seems to be broken. For example, if we ran the test program, we would get lots of memory errors. A careful examination shows the expected value and the memory value are exact opposite in those erroneous memory locations, indicating the write of the previous kernel has not reached the memory yet.

Previously we have cuda 2.1 in our cluster and it was fine. Only with cuda 2.2 that we saw this problem. We found a reboot or reloading of nvidia module would fix the problem and put GPUs into clean state.

We are using Fedora 9, with kernel 2.6.27.21-78.2.41.fc9.x86_64, with driver Kernel Module 185.18.08
I will be happy to provide more information if needed.

thanks
-gshi

UPDATE: we tested old driver 180.51 with cuda 2.2 and that seems to make the problem go away. So the “bug” seems to be in the driver.

thats interesting, im getting very funny behavior havening to do with the same type of operations, i couldn’t isolate a simple case so i can’t post an example … but now reading your post i will try down grading my driver to the one that came out with cuda 2.2 or 2.2 beta.

erdooom, Thanks for replying .

We have users reporting that they got wrong answers in the bad-state GPU node.

I would appreciate it if Nvidia could address this issue.

thanks

-gshi

Looking into this now–looks like I can replicate it, trying to figure out what’s going on.

Tim, I know it’s your job, but man, you’re awesome at jumping onto issues like this. Thanks from all us users! It definitely helps keep the CUDA environment easier to swim in knowing there’s active and responsive support like yours. Thanks!

That’s great. Thanks for looking into this.

-gshi

Is there any update here? For the full duration of the 185 series driver, I’ve been checking the health state after each of my users runs an application. For some applications, this isn’t a problem at all. For others, intermittent. And some applications consistently leave the GPU in a bad state, meaning that I have to tell those users that all their application results cannot be trusted. I’d just back off to 2.1/180.51 but some users are already dependent on the 2.2 feature set.

When can we expect a fix here? I’m desperate.

thx-

Jeremy Enos

NCSA

I believe NVIDIA employers watching this forum are overwhelmed and not able to reply, at same time we are lost, not knowing if the problem present in 2.2 is solved in 2.3 or not, when 2.3 will be out. As another example of the load they are having is the fact that I’m trying to register as developer since pre 2.2 but nothing yet.

FYI- CUDA 2.3/190.09 combination suffers this same issue. I’d hate to see a 190 series release w/o this fixed. It would be best of course to see both 185 and 190 series get fixed.

some update:

with the new released cuda 2.3 and driver 190.16, the issue remains the same

Yeah, I know–we’ve found the problem, should have a fix out in the next release. (I tested it today on your app)

Thanks Tim, I am looking forward to the new release

-gshi

Thanks Tim-

When you say “next release”, you don’t mean 2.4 do you? I was hoping to see a driver update that resolved this… it’s plaguing us terribly. We’re having to tell lots of HPC users that their results can’t be trusted as a result.

Jeremy

There should be another 190.xx (and maybe another 185.xx) release that contains the fix.

some update:

We found the driver 185.18.31 with cuda 2.2 fixed the problem.

http://www.nvidia.com/object/linux_display…_185.18.31.html

Thanks a lot for the driver fix.

However, the latest driver with cuda 2.3 is still broken ( we have postponed 2.3 deployment in our cluster for this reason).

It would be great if the problem can be fixed in the latest driver too.

Looks like 190.xx is still officially beta for Linux. The first “stable” release should fix this (I’ll double check to make sure that everything made it into 190).

Do you have an ETA for this driver being pushed out of beta?

Looking for one.

oh, I’m a dope, apparently it came out four days ago

ftp://download.nvidia.com/XFree86/Linux-x…190.25-pkg0.run
ftp://download.nvidia.com/XFree86/Linux-x…190.25-pkg0.run

(top is x86, bottom is x86-64)

I’m so glad that who is making possible for us using such nice tool (CUDA) is not the same people who is managing the

NVIDIA site because otherwise we would have been still using CPUs instead of GPUs.

A poor newbie user clicking on: GET CUDA will still download old drivers in beta.

on that location (for X86-64 at least) there are 3 file: pkg0/pkg1/pkg2 wich one is good for RH5.3 ?

Thank you for the links.