S1070 device 0 broken Test case provided

fbasile · June 5, 2009, 11:03am

— RESOLVED! It is not an hw problem, it is due to driver —

I have found an unexpected behaviour while working on an S1070.

I have reduced my program to a minimal test case where:

-an FFT is performed and all the used resources are released

-a copy host->device->device->host is performed

-a comparison from the initial vector on the host and the copied vector is performed

The comparison does not return the expected results all the times and it seems wrong segments of areas are considered.

This unexpected behaviour is shown only on the device 0 of the S1070 and not on the remaining devices. Test has been performed on a GeForce 8800 and no unexpected behaviour has occurred. It seems to be a problem of that device only.

The test is minimal, as an example, if removing the FFT the problem disappears.

However, even if it is not shown in the test, but it seems to me that if running another kernel execution instead of the FFT the problem persists, hence it should not be a problem of cufft.

I’ve performed the memtest provided on https://simtk.org/home/memtest and no error has occurred.

I’ve attached the test case showing the errors. Does anybody have an idea of what is going on on my machine?

Thanks,

Francesco

CUDATestCaseFFT.cpp (5.35 KB)

gshi · June 5, 2009, 4:24pm

fbasile,

Maybe you can try the our mem test
[url=“CUDA GPU memtest download | SourceForge.net”]https://sourceforge.net/projects/cudagpumemtest/[/url]

We have been able to flush out errors that memtestG80 cannot detect.
You can try to run the test10 continuously like since it is the most intensive one
./cuda_memtest --disable_all --enable_test 10 -exit_on_error

If you are using cuda 2.2, be aware that it might trigger a bug that we reported earlier
[url=“http://forums.nvidia.com/index.php?showtopic=97379”]http://forums.nvidia.com/index.php?showtopic=97379[/url]

Good luck
-gshi

fbasile · June 8, 2009, 11:14am

gshi,
thanks for the memtest and for the advice. You were right, your memtest fails where the other one successes.
It seems the problem relies on the driver: we are now using the 185.18.14, after installing it the problem has apparently disappeared, but later it has appeared again (your test started to fail). The error disappears for a while by removing the driver with rmmod and loading it again with insmod.
By the way after using your test we encounterd the error on devices 2 3 and 4 too.

May you tell me where to find the driver 180.51 you refer to in [url=“http://forums.nvidia.com/index.php?showtopic=97379?”]http://forums.nvidia.com/index.php?showtopic=97379?[/url]?

Thanks,
Francesco

Jeremy_Enos · June 8, 2009, 1:27pm

You can find it from the “Display Driver Archive Link” that is on the upper left of the download page for the current driver. Or shortcut here to x86_64 drivers (not sure what you’re using):

http://www.nvidia.com/object/linux_amd64_d…ay_archive.html

So 180.51 seems to not have the same susceptibility to errors as the 185 series drivers, however- I think 185 was made specifically for cuda 2.2 feature support (that’s a guess). I don’t know what pieces of cuda2.2 break with 180.51. Basic tests seem to work ok, but I haven’t tested exhaustively. I’d rather just see a 185 series fix.

Jeremy

fbasile · June 8, 2009, 2:37pm

That’s a good guess! cudaFFT 2.2 doesn’t work with 180.51.

Thanks for the link.

Francesco

gshi · June 8, 2009, 5:24pm

Francesco,

While the GPUs are in clean state(after reloading nvidia driver), did you see errors coming from GPU 0?

Usually hardware errors are 1 bit flips, but the errors caused by the driver show all bits are wrong.

It will be interesting if you can post the error message

-gshi

fbasile · June 9, 2009, 7:57am

gshi

When it is in the clean state no errors occur (for a while). After cleaning the state once more tests have run for 10 hours last night without any error. It seems your ansatz (that errors are related to unusual termination of programs, i.e. ctrl+c) to be right.

This is one error message

06/08/2009 17:26:51 cuda_memtest errors found in tesla[0]

Unreported errors since last email: 0

ERROR: NVRM version: NVIDIA UNIX x86_64 Kernel Module 185.18.14 Wed May 27 01:23:47 PDT 2009

ERROR: The unit serial number is 325008000392

ERROR: (test10[Memory stress test]) 188 errors found in block 0

ERROR: the last 10 error addresses are: 0xb782c1f0 0xb782c1f8 0xb2a6db0 0xb2a6db8 0x188f0970 0x188f0978 0x8bb35af0 0x8bb35af8 0x4c831970 0x4c831978

ERROR: 0th error, expected value=0xa27ff0169942cefa, current value=0x5d800fe966bd3105, diff=0xffffffffffffffff (sencond_read=0x5d800fe966bd3105, diff with expected value=0xffffffff

ERROR: 1th error, expected value=0xa27ff0169942cefa, current value=0x5d800fe966bd3105, diff=0xffffffffffffffff (sencond_read=0x5d800fe966bd3105, diff with expected value=0xffffffff

ERROR: 2th error, expected value=0xa27ff0169942cefa, current value=0x5d800fe966bd3105, diff=0xffffffffffffffff (sencond_read=0x5d800fe966bd3105, diff with expected value=0xffffffff

ERROR: 3th error, expected value=0xa27ff0169942cefa, current value=0x5d800fe966bd3105, diff=0xffffffffffffffff (sencond_read=0x5d800fe966bd3105, diff with expected value=0xffffffff

ERROR: 4th error, expected value=0x5d800fe966bd3105, current value=0xa27ff0169942cefa, diff=0xffffffffffffffff (sencond_read=0xa27ff0169942cefa, diff with expected value=0xffffffff

ERROR: 5th error, expected value=0x5d800fe966bd3105, current value=0xa27ff0169942cefa, diff=0xffffffffffffffff (sencond_read=0xa27ff0169942cefa, diff with expected value=0xffffffff

ERROR: 6th error, expected value=0x5d800fe966bd3105, current value=0xa27ff0169942cefa, diff=0xffffffffffffffff (sencond_read=0xa27ff0169942cefa, diff with expected value=0xffffffff

ERROR: 7th error, expected value=0x5d800fe966bd3105, current value=0xa27ff0169942cefa, diff=0xffffffffffffffff (sencond_read=0xa27ff0169942cefa, diff with expected value=0xffffffff

ERROR: 8th error, expected value=0x5d800fe966bd3105, current value=0xa27ff0169942cefa, diff=0xffffffffffffffff (sencond_read=0xa27ff0169942cefa, diff with expected value=0xffffffff

ERROR: 9th error, expected value=0x5d800fe966bd3105, current value=0xa27ff0169942cefa, diff=0xffffffffffffffff (sencond_read=0xa27ff0169942cefa, diff with expected value=0xffffffff

Thanks

Francesco

kalman · June 9, 2009, 8:53am

I have tried your test on a 32 bit linux with ION architecture and I had 2 problems:

a) test.cu line 226 I believe it assumes the architecture is 64 bit indeed the compiler

complains due the 32 shifts being to long

gputest@ion-32:~/cudagpumemtest$ make

nvcc -c -arch sm_13 -DSM_13 -O3 -I. -I/usr/local/cuda/include -I/usr/local/cuda/sdk/common/inc/ -I/home/gputest/NVIDIA_CUDA_SDK/common/inc -o cuda_memtest.o cuda_memtest.cu

nvcc -c -arch sm_13 -DSM_13 -O3 -I. -I/usr/local/cuda/include -I/usr/local/cuda/sdk/common/inc/ -I/home/gputest/NVIDIA_CUDA_SDK/common/inc -o tests.o tests.cu

tests.cu(226): warning: shift count is too large

./tests.cu(443): Advisory: Cannot tell what pointer points to, assuming global memory space

./tests.cu(475): Advisory: Cannot tell what pointer points to, assuming global memory space

./tests.cu(479): Advisory: Cannot tell what pointer points to, assuming global memory space

./tests.cu(519): Advisory: Cannot tell what pointer points to, assuming global memory space

./tests.cu(556): Advisory: Cannot tell what pointer points to, assuming global memory space

./tests.cu(560): Advisory: Cannot tell what pointer points to, assuming global memory space

tests.cu(226): warning: shift count is too large

tests.cu: In function â€˜long unsigned int get_random_num_long()â€™:

tests.cu:226: warning: left shift count >= width of type

nvcc -c -arch sm_13 -DSM_13 -O3 -I. -I/usr/local/cuda/include -I/usr/local/cuda/sdk/common/inc/ -I/home/gputest/NVIDIA_CUDA_SDK/common/inc -o misc.o misc.cpp

nvcc -o cuda_memtest cuda_memtest.o tests.o misc.o -L/usr/local/cuda/lib -lcuda -lcudart

B) Running the tests I got:

[06/09/2009 10:50:54][ion][0]:Warning: Gettin serial number failed

[06/09/2009 10:50:54][ion][0]:NVRM version: NVIDIA UNIX x86 Kernel Module 185.18.14 Wed May 27 02:23:13 PDT 2009

[06/09/2009 10:50:54][ion][0]:num_gpus=1

[06/09/2009 10:50:54][ion][0]:Device name=ION, global memory size=534446080

[06/09/2009 10:50:54][ion][0]:major=1, minor=1

[06/09/2009 10:50:54][ion][0]:Allocated 333 blocks

[06/09/2009 10:50:54][ion][0]:Test0 [Walking 1 bit]

[06/09/2009 10:50:54][ion][0]:ERROR: CUDA error: invalid device function , line 580

I have installed the CUDA 2.2.

Gaetano

mfatica · June 9, 2009, 2:26pm

Ion does not support compute capabilities 1.3.

gshi · June 9, 2009, 3:14pm

gshi

When it is in the clean state no errors occur (for a while). After cleaning the state once more tests have run for 10 hours last night without any error. It seems your ansatz (that errors are related to unusual termination of programs, i.e. ctrl+c) to be right.

This is one error message

06/08/2009 17:26:51 cuda_memtest errors found in tesla[0]

Unreported errors since last email: 0

ERROR: NVRM version: NVIDIA UNIX x86_64 Kernel Module 185.18.14 Wed May 27 01:23:47 PDT 2009

ERROR: The unit serial number is 325008000392

ERROR: (test10[Memory stress test]) 188 errors found in block 0

ERROR: the last 10 error addresses are: 0xb782c1f0 0xb782c1f8 0xb2a6db0 0xb2a6db8 0x188f0970 0x188f0978 0x8bb35af0 0x8bb35af8 0x4c831970 0x4c831978

ERROR: 0th error, expected value=0xa27ff0169942cefa, current value=0x5d800fe966bd3105, diff=0xffffffffffffffff (sencond_read=0x5d800fe966bd3105, diff with expected value=0xffffffff

ERROR: 1th error, expected value=0xa27ff0169942cefa, current value=0x5d800fe966bd3105, diff=0xffffffffffffffff (sencond_read=0x5d800fe966bd3105, diff with expected value=0xffffffff

ERROR: 2th error, expected value=0xa27ff0169942cefa, current value=0x5d800fe966bd3105, diff=0xffffffffffffffff (sencond_read=0x5d800fe966bd3105, diff with expected value=0xffffffff

ERROR: 3th error, expected value=0xa27ff0169942cefa, current value=0x5d800fe966bd3105, diff=0xffffffffffffffff (sencond_read=0x5d800fe966bd3105, diff with expected value=0xffffffff

ERROR: 4th error, expected value=0x5d800fe966bd3105, current value=0xa27ff0169942cefa, diff=0xffffffffffffffff (sencond_read=0xa27ff0169942cefa, diff with expected value=0xffffffff

ERROR: 5th error, expected value=0x5d800fe966bd3105, current value=0xa27ff0169942cefa, diff=0xffffffffffffffff (sencond_read=0xa27ff0169942cefa, diff with expected value=0xffffffff

ERROR: 6th error, expected value=0x5d800fe966bd3105, current value=0xa27ff0169942cefa, diff=0xffffffffffffffff (sencond_read=0xa27ff0169942cefa, diff with expected value=0xffffffff

ERROR: 7th error, expected value=0x5d800fe966bd3105, current value=0xa27ff0169942cefa, diff=0xffffffffffffffff (sencond_read=0xa27ff0169942cefa, diff with expected value=0xffffffff

ERROR: 8th error, expected value=0x5d800fe966bd3105, current value=0xa27ff0169942cefa, diff=0xffffffffffffffff (sencond_read=0xa27ff0169942cefa, diff with expected value=0xffffffff

ERROR: 9th error, expected value=0x5d800fe966bd3105, current value=0xa27ff0169942cefa, diff=0xffffffffffffffff (sencond_read=0xa27ff0169942cefa, diff with expected value=0xffffffff

Thanks

Francesco

Yeap, these error messages from the cuda2.2 bug, not from the actual hardware errors.

We have used the test to run over 500+ GPUs. The slowest to find a hardware error is 12 hours, i.e. the error appears around once in 12 hours.

Since you have have run it 10+ hours without problem, I think the chance is small to detect any error although you may run it longer (24 hours) to see what happens.

-gshi

gshi · June 9, 2009, 3:17pm

We did not test with 32 bit. We did run it with Quadro FX 5600

Try

%make cuda_memtest_sm10

and see if works for u

-gshi

I have tried your test on a 32 bit linux with ION architecture and I had 2 problems:

a) test.cu line 226 I believe it assumes the architecture is 64 bit indeed the compiler
complains due the 32 shifts being to long
gputest@ion-32:~/cudagpumemtest$ make

nvcc -c -arch sm_13 -DSM_13 -O3 -I. -I/usr/local/cuda/include -I/usr/local/cuda/sdk/common/inc/ -I/home/gputest/NVIDIA_CUDA_SDK/common/inc -o cuda_memtest.o cuda_memtest.cu

nvcc -c -arch sm_13 -DSM_13 -O3 -I. -I/usr/local/cuda/include -I/usr/local/cuda/sdk/common/inc/ -I/home/gputest/NVIDIA_CUDA_SDK/common/inc -o tests.o tests.cu

tests.cu(226): warning: shift count is too large

./tests.cu(443): Advisory: Cannot tell what pointer points to, assuming global memory space

./tests.cu(475): Advisory: Cannot tell what pointer points to, assuming global memory space

./tests.cu(479): Advisory: Cannot tell what pointer points to, assuming global memory space

./tests.cu(479): Advisory: Cannot tell what pointer points to, assuming global memory space

./tests.cu(519): Advisory: Cannot tell what pointer points to, assuming global memory space

./tests.cu(556): Advisory: Cannot tell what pointer points to, assuming global memory space

./tests.cu(560): Advisory: Cannot tell what pointer points to, assuming global memory space

./tests.cu(560): Advisory: Cannot tell what pointer points to, assuming global memory space

tests.cu(226): warning: shift count is too large

tests.cu: In function â€˜long unsigned int get_random_num_long()â€™:

tests.cu:226: warning: left shift count >= width of type

nvcc -c -arch sm_13 -DSM_13 -O3 -I. -I/usr/local/cuda/include -I/usr/local/cuda/sdk/common/inc/ -I/home/gputest/NVIDIA_CUDA_SDK/common/inc -o misc.o misc.cpp

nvcc -o cuda_memtest cuda_memtest.o tests.o misc.o -L/usr/local/cuda/lib -lcuda -lcudart

B) Running the tests I got:

[06/09/2009 10:50:54][ion][0]:Warning: Gettin serial number failed

[06/09/2009 10:50:54][ion][0]:NVRM version: NVIDIA UNIX x86 Kernel Module 185.18.14 Wed May 27 02:23:13 PDT 2009

[06/09/2009 10:50:54][ion][0]:num_gpus=1

[06/09/2009 10:50:54][ion][0]:Device name=ION, global memory size=534446080

[06/09/2009 10:50:54][ion][0]:major=1, minor=1

[06/09/2009 10:50:54][ion][0]:Allocated 333 blocks

[06/09/2009 10:50:54][ion][0]:Test0 [Walking 1 bit]

[06/09/2009 10:50:54][ion][0]:ERROR: CUDA error: invalid device function , line 580

I have installed the CUDA 2.2.

Gaetano

Topic		Replies	Views
cuda 2.2 bug? CUDA Programming and Performance	29	19807	May 3, 2010
CUDA 3.2 Driver BROKE ? Oops.... CUDA Programming and Performance	20	11487	December 22, 2010
problem running demos CUDA Programming and Performance	9	8275	January 1, 2009
Crash after large number of FFTs CUDA Programming and Performance	4	4346	October 2, 2007
SDK sample code failures only on samples that launch a kernel CUDA Programming and Performance	17	8787	January 7, 2009
Is there a memory leak in CUDA CUDA Programming and Performance	6	7225	June 11, 2008
CUDA 2.1 Beta Problem/Bugs (Linux) CUDA Programming and Performance	5	1701	January 6, 2009
Maximum Threads for Kernel Call CUDA Programming and Performance	38	16651	May 25, 2010
deviceQuery passes but other demos fail CUDA Programming and Performance	7	2590	January 22, 2009
GPU in state where results are not reproducible! CUDA Programming and Performance	50	17017	November 2, 2012

S1070 device 0 broken Test case provided

Related topics