— RESOLVED! It is not an hw problem, it is due to driver —
I have found an unexpected behaviour while working on an S1070.
I have reduced my program to a minimal test case where:
-an FFT is performed and all the used resources are released
-a copy host->device->device->host is performed
-a comparison from the initial vector on the host and the copied vector is performed
The comparison does not return the expected results all the times and it seems wrong segments of areas are considered.
This unexpected behaviour is shown only on the device 0 of the S1070 and not on the remaining devices. Test has been performed on a GeForce 8800 and no unexpected behaviour has occurred. It seems to be a problem of that device only.
The test is minimal, as an example, if removing the FFT the problem disappears.
However, even if it is not shown in the test, but it seems to me that if running another kernel execution instead of the FFT the problem persists, hence it should not be a problem of cufft.
We have been able to flush out errors that memtestG80 cannot detect.
You can try to run the test10 continuously like since it is the most intensive one
./cuda_memtest --disable_all --enable_test 10 -exit_on_error
gshi,
thanks for the memtest and for the advice. You were right, your memtest fails where the other one successes.
It seems the problem relies on the driver: we are now using the 185.18.14, after installing it the problem has apparently disappeared, but later it has appeared again (your test started to fail). The error disappears for a while by removing the driver with rmmod and loading it again with insmod.
By the way after using your test we encounterd the error on devices 2 3 and 4 too.
You can find it from the “Display Driver Archive Link” that is on the upper left of the download page for the current driver. Or shortcut here to x86_64 drivers (not sure what you’re using):
So 180.51 seems to not have the same susceptibility to errors as the 185 series drivers, however- I think 185 was made specifically for cuda 2.2 feature support (that’s a guess). I don’t know what pieces of cuda2.2 break with 180.51. Basic tests seem to work ok, but I haven’t tested exhaustively. I’d rather just see a 185 series fix.
When it is in the clean state no errors occur (for a while). After cleaning the state once more tests have run for 10 hours last night without any error. It seems your ansatz (that errors are related to unusual termination of programs, i.e. ctrl+c) to be right.
This is one error message
06/08/2009 17:26:51 cuda_memtest errors found in tesla[0]
Yeap, these error messages from the cuda2.2 bug, not from the actual hardware errors.
We have used the test to run over 500+ GPUs. The slowest to find a hardware error is 12 hours, i.e. the error appears around once in 12 hours.
Since you have have run it 10+ hours without problem, I think the chance is small to detect any error although you may run it longer (24 hours) to see what happens.