We have been experience some problem with S1070 GPUs in our cluster since we installed cuda 2.2. The problem is easily reproducible with the attached code. What the code does is to launch read and write kernels back to back in the same stream. Easy time the kernel will read some memory location and compare it with the expected pattern, record error if they do not match, write the compliment of the pattern to the memory location. This program will cause the GPU to bad state if it is killed in the middle of execution (e.g. ctrl + c). By bad state, I mean
either the next GPU program will hang on cudaMalloc() function call.
or the the order of kernel execution in the stream seems to be broken. For example, if we ran the test program, we would get lots of memory errors. A careful examination shows the expected value and the memory value are exact opposite in those erroneous memory locations, indicating the write of the previous kernel has not reached the memory yet.
Previously we have cuda 2.1 in our cluster and it was fine. Only with cuda 2.2 that we saw this problem. We found a reboot or reloading of nvidia module would fix the problem and put GPUs into clean state.
We are using Fedora 9, with kernel 2.6.27.21-78.2.41.fc9.x86_64, with driver Kernel Module 185.18.08
I will be happy to provide more information if needed.
thanks
-gshi
UPDATE: we tested old driver 180.51 with cuda 2.2 and that seems to make the problem go away. So the “bug” seems to be in the driver.
thats interesting, im getting very funny behavior havening to do with the same type of operations, i couldn’t isolate a simple case so i can’t post an example … but now reading your post i will try down grading my driver to the one that came out with cuda 2.2 or 2.2 beta.
Tim, I know it’s your job, but man, you’re awesome at jumping onto issues like this. Thanks from all us users! It definitely helps keep the CUDA environment easier to swim in knowing there’s active and responsive support like yours. Thanks!
Is there any update here? For the full duration of the 185 series driver, I’ve been checking the health state after each of my users runs an application. For some applications, this isn’t a problem at all. For others, intermittent. And some applications consistently leave the GPU in a bad state, meaning that I have to tell those users that all their application results cannot be trusted. I’d just back off to 2.1/180.51 but some users are already dependent on the 2.2 feature set.
I believe NVIDIA employers watching this forum are overwhelmed and not able to reply, at same time we are lost, not knowing if the problem present in 2.2 is solved in 2.3 or not, when 2.3 will be out. As another example of the load they are having is the fact that I’m trying to register as developer since pre 2.2 but nothing yet.
FYI- CUDA 2.3/190.09 combination suffers this same issue. I’d hate to see a 190 series release w/o this fixed. It would be best of course to see both 185 and 190 series get fixed.
When you say “next release”, you don’t mean 2.4 do you? I was hoping to see a driver update that resolved this… it’s plaguing us terribly. We’re having to tell lots of HPC users that their results can’t be trusted as a result.
Looks like 190.xx is still officially beta for Linux. The first “stable” release should fix this (I’ll double check to make sure that everything made it into 190).