Well, I don’t expect much support as I know I will not be very helpful in proving enough information in this thread, but it is worth a (free) try! Unfortunately I cannot share the core for this problem.
Here is the situation… I’ve been working on this project for around 2 years and there are some parts in it that I have not touched in ~1 years. When I switched over to driver version 270.81 (shipping with cuda 4.0), I started getting random “unknown error” in these parts.
The are random in that : I run around ~300 times through a series of 2 kernels and the run# it will crash it not always the same, and the kernel within the 2 that are run in a given run# is also not always the same.
I should also say that these ~300 runs are independent one from another and they are not “additive”.
This is with Windows 7 64 bits, 270.81 drivers and cuda toolkit 3.2 32bits. With the same setup but driver 260.93, everything works fine. I’ve also tried the combination 270.81 and cuda 4.0 but for the same bad result. On a GTX480.
So if anyone has any insight on what might differ between these two driver versions… I’m all ears!
It is all simple precision (unless i forgot an ‘f’ somewhere) and compiled with -use_fast_math (or whatever the correct spelling is!).
Out of curiosity, what could go wrong with double precision? I seem to recall a thread a month ago about random failings on geforce hardware and not teslas, is that what you are referring to?
Yeah that’s what I was thinking of. Perhaps it’s a hardware issue triggered by some bad driver behaviour. Apart from this, I have no more idea External Image
I have also been having issues with CUDA Toolkit 4.0 final with the new developer driver 270.81. If I run my program on 4.0 prerelease with driver version 270.61, I have no problems. However, with the new driver, I have been having an increasing frequency of random failures in my program and I know it is not the program itself. Also, when I run this program and get these errors, I get a message that “Display Driver 270.81 has stopped responding and has successfully recovered” or something like that. I am running this program on a Tesla C2050, so it is also happening on the Tesla series. I am also running this on Windows 7 64-bit. I think it is a driver issue because we are using different toolkit versions. I have to use this driver because no other driver will work with Toolkit 4.0 final at the moment because of the other issue that I can’t find devices if I don’t use this driver version. Is Nvidia aware of these issues? If not, I can start a request after work today. If anyone has found a solution other than switching toolkit versions and drivers, please let me know.
Edit: I would also like to add that I am not using this card as a graphics card and my code is single-precision.
@Rose
Maybe you should first ensure that the timeout mechanism is disabled. You can check that using the visual profiler. View->device. Or you could just run devQuery
Thanks for responding. I checked and it is enabled. I have never had to disable this feature before, but I will try disabling this on Monday and post my results. The errors I get are intermittent, so I will not be able to tell whether or not it has fixed the issue until I run the code several times.