DGEMM-based burn-in test

tmurray · January 13, 2009, 11:36pm

As part of my continuing effort to make more of my internal tools for system testing available to you guys, here’s a burn-in test I wrote for GT200-based systems. It performs DGEMMs on every capable device simultaneously until device memory is filled and will repeat if you want. It also checks the results of each individual DGEMM to help you track down general stability problems. Time to completion varies widely with options, so feel free to take a look.

It requires CUDA 2.1, because it uses the ability to poll for an active watchdog timer (you can guess who the major proponent of this was). Like most of what I do, it’s Linux only for the moment, although I’m in the process of porting it to Windows. Compile with

nvcc -o dgemmSweep -arch sm_13 dgemmSweep.cu -lcublas

Feedback is welcome.

tmurray · January 13, 2009, 11:36pm

stealing this post again for a changelog:

1.0: initial release, Linux only.
1.1: still Linux only, fixed a stupid bug with launching threads on mixed-GPU machines.

ldpaniak · January 15, 2009, 6:18am

Thanks for another useful tool.

Unfortunately, I am having trouble getting it compiled on a fresh ubuntu-8.04/cuda 2.1 install with GTX280 hardware:

[codebox]hpc-user@gpu-hpc:~$ nvcc -o dgemmSweep -arch sm_13 dgemmSweep.cu -lcublas

dgemmSweep.cu(196): error: class “cudaDeviceProp” has no member “kernelExecTimeoutEnabled”

1 error detected in the compilation of “/tmp/tmpxft_000012fc_00000000-4_dgemmSweep.cpp1.ii”.[/codebox]

Any hints on what the problem is?

tmurray · January 15, 2009, 8:04am

Are you sure that’s 2.1 final and not 2.1 beta? It has to be 2.1 final.

ldpaniak · January 15, 2009, 2:24pm

Yes, it is 2.1 beta. Is 2.1 final available to the general public for debian/ubuntu? I would appreciate a link if possible.

Also, will these tools (dgemm burn-in, concBandwidthTest…) be making an appearance in the toolkit? I think they would be great additions.

Thanks

tmurray · January 15, 2009, 5:21pm

2.1 final is out (STILL probably not on the website, but check the CUDA announcements forum for a link). These will eventually be included somewhere, just trying to figure out the right place for that.

ldpaniak · January 15, 2009, 6:25pm

Found the new driver and toolkit (180.22). Compilation goes without issue now.

Thanks.

SPWorley · January 16, 2009, 5:35am

Tim, excellent tool!
I had thought about making a burnin test myself, but I am very lazy and never did anything.

Do you think DGEMM has a good cascading behavior, so one small error in a memory or compute will get magnified to make the error obvious?
I thought I might use an FFT as a basis since a single sample error would create a delta function on input, which propagates to all frequencies of the FFT. (Hmm, but that wouldn’t magnify the magnitude of the error, ideally it should be a nice feedback that makes it grow.)

Big extra points to anyone who whips up a script to iterate over various memory and shader clocks and use this test to make a Shmoo plot of your card’s stability regions.

tmurray · February 10, 2009, 1:24am

bump–an updated version that isn’t stupid about launching threads on mixed-gpu machinse

ldpaniak · February 10, 2009, 4:32am

Hi,

The new script does not see one of the three capable devices on the system (a third GTX280):

[codebox]hpc-user@gpu-hpc:~$ ./dgemmSweep11 1

Testing device 1: GeForce GTX 280

Testing device 2: GeForce GTX 280

device = 0

iterSize = 5952

Device 1: i = 128

…

[/codebox]

tmurray · February 10, 2009, 7:51am

Are you using it for display? If so, it’s not capable.

netllama · February 10, 2009, 2:49pm

Does deviceQuery from the SDK see all 3? Which driver are you using?

tmurray · February 10, 2009, 5:12pm

a bit of clarification because I think I made netllama all worried:

dgemmSweep will not use cards that have a watchdog timer enabled because large DGEMMs will trigger the watchdog.

ldpaniak · February 11, 2009, 12:18am

[codebox]hpc-user@gpu-hpc:~$ deviceQuery

There are 3 devices supporting CUDA

…[/codebox]

Driver is 180.22 for CUDA2.1 on 64-bit Linux (ubuntu 8.04.2).

No attached monitor.

The system runs HOOMD very well on all three GPUs

tmurray · February 11, 2009, 12:45am

is it booting into gdm?

ldpaniak · February 11, 2009, 2:20am

xdm

tmurray · February 11, 2009, 7:45am

so it’s running X on one card and therefore has a watchdog timer enabled, meaning it won’t be used by this

ldpaniak · February 11, 2009, 1:30pm

This begs the question: Is there a way to install CUDA in Linux without an X installation on the system? The nvidia driver installer insists on it by default. Is there a switch to override? There is often no reason for a headless compute server to run X.

seibert · February 11, 2009, 1:41pm

Change the default runlevel in /etc/inittab from 5 to 3. Then xdm won’t start. Since X also creates the /dev/nvidia* devices for you, you’ll have to use the script in the Release Notes to create these device files at boot time.

MisterAnderson42 · February 11, 2009, 2:47pm

No it doesn’t. I’ve installed the stock nvidia driver dozens of times on boxes without X installed.

It asks if you want to update some OpenGL library and it doesn’t really matter if you say yes or no. The library can be installed even if no one can use it.