S1070 device 0 broken Test case provided

— RESOLVED! It is not an hw problem, it is due to driver —

I have found an unexpected behaviour while working on an S1070.

I have reduced my program to a minimal test case where:

-an FFT is performed and all the used resources are released

-a copy host->device->device->host is performed

-a comparison from the initial vector on the host and the copied vector is performed

The comparison does not return the expected results all the times and it seems wrong segments of areas are considered.

This unexpected behaviour is shown only on the device 0 of the S1070 and not on the remaining devices. Test has been performed on a GeForce 8800 and no unexpected behaviour has occurred. It seems to be a problem of that device only.

The test is minimal, as an example, if removing the FFT the problem disappears.

However, even if it is not shown in the test, but it seems to me that if running another kernel execution instead of the FFT the problem persists, hence it should not be a problem of cufft.

I’ve performed the memtest provided on https://simtk.org/home/memtest and no error has occurred.

I’ve attached the test case showing the errors. Does anybody have an idea of what is going on on my machine?

Thanks,

Francesco

CUDATestCaseFFT.cpp (5.35 KB)

fbasile,

Maybe you can try the our mem test
[url=“CUDA GPU memtest download | SourceForge.net”]https://sourceforge.net/projects/cudagpumemtest/[/url]

We have been able to flush out errors that memtestG80 cannot detect.
You can try to run the test10 continuously like since it is the most intensive one
./cuda_memtest --disable_all --enable_test 10 -exit_on_error

If you are using cuda 2.2, be aware that it might trigger a bug that we reported earlier
[url=“http://forums.nvidia.com/index.php?showtopic=97379”]http://forums.nvidia.com/index.php?showtopic=97379[/url]

Good luck
-gshi

gshi,
thanks for the memtest and for the advice. You were right, your memtest fails where the other one successes.
It seems the problem relies on the driver: we are now using the 185.18.14, after installing it the problem has apparently disappeared, but later it has appeared again (your test started to fail). The error disappears for a while by removing the driver with rmmod and loading it again with insmod.
By the way after using your test we encounterd the error on devices 2 3 and 4 too.

May you tell me where to find the driver 180.51 you refer to in [url=“http://forums.nvidia.com/index.php?showtopic=97379?”]http://forums.nvidia.com/index.php?showtopic=97379?[/url]?

Thanks,
Francesco

You can find it from the “Display Driver Archive Link” that is on the upper left of the download page for the current driver. Or shortcut here to x86_64 drivers (not sure what you’re using):

http://www.nvidia.com/object/linux_amd64_d…ay_archive.html

So 180.51 seems to not have the same susceptibility to errors as the 185 series drivers, however- I think 185 was made specifically for cuda 2.2 feature support (that’s a guess). I don’t know what pieces of cuda2.2 break with 180.51. Basic tests seem to work ok, but I haven’t tested exhaustively. I’d rather just see a 185 series fix.

Jeremy

That’s a good guess! cudaFFT 2.2 doesn’t work with 180.51.

Thanks for the link.

Francesco

Francesco,

While the GPUs are in clean state(after reloading nvidia driver), did you see errors coming from GPU 0?

Usually hardware errors are 1 bit flips, but the errors caused by the driver show all bits are wrong.

It will be interesting if you can post the error message

-gshi

gshi

When it is in the clean state no errors occur (for a while). After cleaning the state once more tests have run for 10 hours last night without any error. It seems your ansatz (that errors are related to unusual termination of programs, i.e. ctrl+c) to be right.

This is one error message

06/08/2009 17:26:51 cuda_memtest errors found in tesla[0]

Unreported errors since last email: 0

ERROR: NVRM version: NVIDIA UNIX x86_64 Kernel Module 185.18.14 Wed May 27 01:23:47 PDT 2009

ERROR: The unit serial number is 325008000392

ERROR: (test10[Memory stress test]) 188 errors found in block 0

ERROR: the last 10 error addresses are: 0xb782c1f0 0xb782c1f8 0xb2a6db0 0xb2a6db8 0x188f0970 0x188f0978 0x8bb35af0 0x8bb35af8 0x4c831970 0x4c831978

ERROR: 0th error, expected value=0xa27ff0169942cefa, current value=0x5d800fe966bd3105, diff=0xffffffffffffffff (sencond_read=0x5d800fe966bd3105, diff with expected value=0xffffffff

ERROR: 1th error, expected value=0xa27ff0169942cefa, current value=0x5d800fe966bd3105, diff=0xffffffffffffffff (sencond_read=0x5d800fe966bd3105, diff with expected value=0xffffffff

ERROR: 2th error, expected value=0xa27ff0169942cefa, current value=0x5d800fe966bd3105, diff=0xffffffffffffffff (sencond_read=0x5d800fe966bd3105, diff with expected value=0xffffffff

ERROR: 3th error, expected value=0xa27ff0169942cefa, current value=0x5d800fe966bd3105, diff=0xffffffffffffffff (sencond_read=0x5d800fe966bd3105, diff with expected value=0xffffffff

ERROR: 4th error, expected value=0x5d800fe966bd3105, current value=0xa27ff0169942cefa, diff=0xffffffffffffffff (sencond_read=0xa27ff0169942cefa, diff with expected value=0xffffffff

ERROR: 5th error, expected value=0x5d800fe966bd3105, current value=0xa27ff0169942cefa, diff=0xffffffffffffffff (sencond_read=0xa27ff0169942cefa, diff with expected value=0xffffffff

ERROR: 6th error, expected value=0x5d800fe966bd3105, current value=0xa27ff0169942cefa, diff=0xffffffffffffffff (sencond_read=0xa27ff0169942cefa, diff with expected value=0xffffffff

ERROR: 7th error, expected value=0x5d800fe966bd3105, current value=0xa27ff0169942cefa, diff=0xffffffffffffffff (sencond_read=0xa27ff0169942cefa, diff with expected value=0xffffffff

ERROR: 8th error, expected value=0x5d800fe966bd3105, current value=0xa27ff0169942cefa, diff=0xffffffffffffffff (sencond_read=0xa27ff0169942cefa, diff with expected value=0xffffffff

ERROR: 9th error, expected value=0x5d800fe966bd3105, current value=0xa27ff0169942cefa, diff=0xffffffffffffffff (sencond_read=0xa27ff0169942cefa, diff with expected value=0xffffffff

Thanks

Francesco

I have tried your test on a 32 bit linux with ION architecture and I had 2 problems:

a) test.cu line 226 I believe it assumes the architecture is 64 bit indeed the compiler

complains due the 32 shifts being to long

gputest@ion-32:~/cudagpumemtest$ make

nvcc -c -arch sm_13 -DSM_13 -O3 -I. -I/usr/local/cuda/include -I/usr/local/cuda/sdk/common/inc/ -I/home/gputest/NVIDIA_CUDA_SDK/common/inc -o cuda_memtest.o cuda_memtest.cu

nvcc -c -arch sm_13 -DSM_13 -O3 -I. -I/usr/local/cuda/include -I/usr/local/cuda/sdk/common/inc/ -I/home/gputest/NVIDIA_CUDA_SDK/common/inc -o tests.o tests.cu

tests.cu(226): warning: shift count is too large

./tests.cu(443): Advisory: Cannot tell what pointer points to, assuming global memory space

./tests.cu(475): Advisory: Cannot tell what pointer points to, assuming global memory space

./tests.cu(479): Advisory: Cannot tell what pointer points to, assuming global memory space

./tests.cu(479): Advisory: Cannot tell what pointer points to, assuming global memory space

./tests.cu(519): Advisory: Cannot tell what pointer points to, assuming global memory space

./tests.cu(556): Advisory: Cannot tell what pointer points to, assuming global memory space

./tests.cu(560): Advisory: Cannot tell what pointer points to, assuming global memory space

./tests.cu(560): Advisory: Cannot tell what pointer points to, assuming global memory space

tests.cu(226): warning: shift count is too large

tests.cu: In function ‘long unsigned int get_random_num_long()’:

tests.cu:226: warning: left shift count >= width of type

nvcc -c -arch sm_13 -DSM_13 -O3 -I. -I/usr/local/cuda/include -I/usr/local/cuda/sdk/common/inc/ -I/home/gputest/NVIDIA_CUDA_SDK/common/inc -o misc.o misc.cpp

nvcc -o cuda_memtest cuda_memtest.o tests.o misc.o -L/usr/local/cuda/lib -lcuda -lcudart

B) Running the tests I got:

[06/09/2009 10:50:54][ion][0]:Warning: Gettin serial number failed

[06/09/2009 10:50:54][ion][0]:NVRM version: NVIDIA UNIX x86 Kernel Module 185.18.14 Wed May 27 02:23:13 PDT 2009

[06/09/2009 10:50:54][ion][0]:num_gpus=1

[06/09/2009 10:50:54][ion][0]:Device name=ION, global memory size=534446080

[06/09/2009 10:50:54][ion][0]:major=1, minor=1

[06/09/2009 10:50:54][ion][0]:Allocated 333 blocks

[06/09/2009 10:50:54][ion][0]:Test0 [Walking 1 bit]

[06/09/2009 10:50:54][ion][0]:ERROR: CUDA error: invalid device function , line 580

I have installed the CUDA 2.2.

Gaetano

Ion does not support compute capabilities 1.3.

Yeap, these error messages from the cuda2.2 bug, not from the actual hardware errors.

We have used the test to run over 500+ GPUs. The slowest to find a hardware error is 12 hours, i.e. the error appears around once in 12 hours.

Since you have have run it 10+ hours without problem, I think the chance is small to detect any error although you may run it longer (24 hours) to see what happens.

-gshi

We did not test with 32 bit. We did run it with Quadro FX 5600

Try

%make cuda_memtest_sm10

and see if works for u

-gshi