DGEMM-based burn-in test

As part of my continuing effort to make more of my internal tools for system testing available to you guys, here’s a burn-in test I wrote for GT200-based systems. It performs DGEMMs on every capable device simultaneously until device memory is filled and will repeat if you want. It also checks the results of each individual DGEMM to help you track down general stability problems. Time to completion varies widely with options, so feel free to take a look.

It requires CUDA 2.1, because it uses the ability to poll for an active watchdog timer (you can guess who the major proponent of this was). Like most of what I do, it’s Linux only for the moment, although I’m in the process of porting it to Windows. Compile with

nvcc -o dgemmSweep -arch sm_13 dgemmSweep.cu -lcublas

Feedback is welcome.

stealing this post again for a changelog:

1.0: initial release, Linux only.
1.1: still Linux only, fixed a stupid bug with launching threads on mixed-GPU machines.

Thanks for another useful tool.

Unfortunately, I am having trouble getting it compiled on a fresh ubuntu-8.04/cuda 2.1 install with GTX280 hardware:

[codebox]hpc-user@gpu-hpc:~$ nvcc -o dgemmSweep -arch sm_13 dgemmSweep.cu -lcublas

dgemmSweep.cu(196): error: class “cudaDeviceProp” has no member “kernelExecTimeoutEnabled”

1 error detected in the compilation of “/tmp/tmpxft_000012fc_00000000-4_dgemmSweep.cpp1.ii”.[/codebox]

Any hints on what the problem is?

Are you sure that’s 2.1 final and not 2.1 beta? It has to be 2.1 final.

Yes, it is 2.1 beta. Is 2.1 final available to the general public for debian/ubuntu? I would appreciate a link if possible.

Also, will these tools (dgemm burn-in, concBandwidthTest…) be making an appearance in the toolkit? I think they would be great additions.

Thanks

2.1 final is out (STILL probably not on the website, but check the CUDA announcements forum for a link). These will eventually be included somewhere, just trying to figure out the right place for that.

Found the new driver and toolkit (180.22). Compilation goes without issue now.

Thanks.

Tim, excellent tool!
I had thought about making a burnin test myself, but I am very lazy and never did anything.

Do you think DGEMM has a good cascading behavior, so one small error in a memory or compute will get magnified to make the error obvious?
I thought I might use an FFT as a basis since a single sample error would create a delta function on input, which propagates to all frequencies of the FFT. (Hmm, but that wouldn’t magnify the magnitude of the error, ideally it should be a nice feedback that makes it grow.)

Big extra points to anyone who whips up a script to iterate over various memory and shader clocks and use this test to make a Shmoo plot of your card’s stability regions.

bump–an updated version that isn’t stupid about launching threads on mixed-gpu machinse

Hi,

The new script does not see one of the three capable devices on the system (a third GTX280):

[codebox]hpc-user@gpu-hpc:~$ ./dgemmSweep11 1

Testing device 1: GeForce GTX 280

Testing device 2: GeForce GTX 280

device = 0

device = 0

iterSize = 5952

Device 1: i = 128

[/codebox]

Are you using it for display? If so, it’s not capable.

Does deviceQuery from the SDK see all 3? Which driver are you using?

a bit of clarification because I think I made netllama all worried:

dgemmSweep will not use cards that have a watchdog timer enabled because large DGEMMs will trigger the watchdog.

[codebox]hpc-user@gpu-hpc:~$ deviceQuery

There are 3 devices supporting CUDA

…[/codebox]

Driver is 180.22 for CUDA2.1 on 64-bit Linux (ubuntu 8.04.2).

No attached monitor.

The system runs HOOMD very well on all three GPUs

is it booting into gdm?

xdm

so it’s running X on one card and therefore has a watchdog timer enabled, meaning it won’t be used by this

This begs the question: Is there a way to install CUDA in Linux without an X installation on the system? The nvidia driver installer insists on it by default. Is there a switch to override? There is often no reason for a headless compute server to run X.

Change the default runlevel in /etc/inittab from 5 to 3. Then xdm won’t start. Since X also creates the /dev/nvidia* devices for you, you’ll have to use the script in the Release Notes to create these device files at boot time.

No it doesn’t. I’ve installed the stock nvidia driver dozens of times on boxes without X installed.

It asks if you want to update some OpenGL library and it doesn’t really matter if you say yes or no. The library can be installed even if no one can use it.