DGEMM-based burn-in test

I have compiled this test using the latest 2.1 CUDA. Reinstalled the driver just to be sure. (I complained elsewhere about instability - since no one else complained similarly, I did a fresh reinstall and 2.1 seems to work o.k.)

For me, on my Suse 10.3 linux system with its single gtx260 the test seems to immediately fail after the screen blinks a bit. I “init 3” to get out of X to run the test. The usual calculations I run on this seem to work o.k.

So:

What does failing the test mean?

How can I “fix” my system so that it passes this test?

How worried should I be that it doesn’t pass this burn in test?

Thx,

B.C.

ADDED LATER: Ah! I have it! Not only does X have to be turned off, but also the linux framebuffer has to be turned off as well. Like Lady Galadriel, my gtx 260 passes the test. Although I don’t think my gtx 260 will pass into the west.

For reference:

[codebox]

./deviceQuery

There is 1 device supporting CUDA

Device 0: “GeForce GTX 260”

Major revision number: 1

Minor revision number: 3

Total amount of global memory: 939196416 bytes

Number of multiprocessors: 24

Number of cores: 192

Total amount of constant memory: 65536 bytes

Total amount of shared memory per block: 16384 bytes

Total number of registers available per block: 16384

Warp size: 32

Maximum number of threads per block: 512

Maximum sizes of each dimension of a block: 512 x 512 x 64

Maximum sizes of each dimension of a grid: 65535 x 65535 x 1

Maximum memory pitch: 262144 bytes

Texture alignment: 256 bytes

Clock rate: 1.30 GHz

Concurrent copy and execution: Yes

Test PASSED

[/codebox]

Anyone try this on Vista 64bit?

Is this only supported for Linux OS? or it also for Windows as well?

Whats the normal running time or average time with normal or default settings?

Thanks

Really long time–a few hours.

I ported this to Windows at some point, but for the life of me I can’t find where I put the source. Either way you can probably just compile what I’ve put up here with a Windows port of pthreads and it should work fine.

I really dont know how to compile this…I have found this test for windows xp. I am currently running this test with 3 tesla c1060 cards and its still running since 9.55am and now the time is 3.00pm it is still running. This is running on win xp pro 64bit…so far so good.

Thanks

Just a question what does it mean if a device has failed after some iterations??
Edit:
And this is how i am running it,
./bin/dgemmSweep 0 25

Here is an example result:

sizeof(void*) == 8
Testing device 0: Tesla C1060
Testing device 1: Tesla C1060
Testing device 2: Tesla C1060
iterSize = 13024
Performing 1 iterations with increment size 32 on device 0…
Device 0, iteration 0: i = 128
Device 0, iteration 0: i = 160
Device 0, iteration 0: i = 192
Device 0, iteration 0: i = 224
.
.
.
.
Device 0, iteration 0: i = 12928
Device 1, iteration 0: i = 12928
Device 2, iteration 0: i = 12960
Device 0, iteration 0: i = 12960
Device 1, iteration 0: i = 12960
Device 2, iteration 0: i = 12992
Device 0, iteration 0: i = 12992
Device 1, iteration 0: i = 12992
Finished iteration 0
Device 2 completed successfully
Finished iteration 0
Device 0 completed successfully
Finished iteration 0
Device 1 completed successfully
dgemmSweep PASSED.

What does the “i = 12992” stand for and mean? and “iterSize”
Thanks.

i is the dimension of the current DGEMM test.

Is there a test like this for c2050/c2070? I’m very interested to see how long the test would take to complete on these cards. Also will serve as a good burn-in test! :)

Should just work?

No, It doesn’t work with this card. I think its because of the new architecture/ design of the new card. It gives errors when running the test on this card. I was trying to do the test on a c1060 and c2050.

What errors does it return?

Another burn-in test using FFTs forward and backward, checking for bit errors.

Haven’t used this yet but eventually I’ll set it up with Tim’s DGEMM and also a dumb little script I wrote to run many of the SDK examples (most act like at least crude validity tests too by printing SUCCESS). Should be a nice set to run on every new card, especially combined with one of the memtest tools.

A link to another thread which seemingly has a GPU-specific stability issue… if it ends up being hardware related, then that code would also be useful as a stability checker.

Maybe of interest to some of you:

Gentoo ebuild also available at:

(along with cuda_memtest)