GPU diagnostics How to test a GPU

kuisma · August 13, 2008, 9:57am

Hi folks. I’m using a GTX 280 (factory oc), and every few hours runtime my application freezes on the GPU. Either it is an application error, or a GPU hardware malfunction. Application error would of course be the most likely if it wasn’t so that it has never hung on any of my other GPUs (8800- and 9800-series). “kill -9” does not terminate the frozen task, and it’s not possible to start any other program (e.g. deviceQuery) neither. I need to reboot the system to get it working again.

Is there any way I can determine if this is due to a faulty GPU? Some diagnostics program or such?

Regards,

Kuisma

HannibalxLecter · August 14, 2008, 7:35am

Looks like no one has shot you a response yet, but I am needing to find a method to test my GPU as well. GeForce 8800GT.

The closest I have seen is GPU-Z which only gives allot of detailed information about your card, but doesnt run any diagnostics.

Please update us if you find anything suitable in the meantime. Good Luck!!!

MisterAnderson42 · August 14, 2008, 2:47pm

I haven’t heard of any GPU diagnostic programs either. All I can suggest is to run a known stable CUDA app and see if you get the same behavior.

I can suggest my own app which is known to be stable for week+ runs on G80 and G92 (unfortunately, I don’t have a G200 to test…). In fact, I just posted a request for a benchmark: [url=“http://forums.nvidia.com/index.php?showtopic=74971&view=findpost&p=425624”]The Official NVIDIA Forums | NVIDIA

If you modify the benchmark script with run(20e6) instead of run(50000), it will run for many hours.

edit: scratch that, theMarix nicely ran the benchmark and found that HOOMD isn’t stable on GTX 280 :(

theMarix · August 14, 2008, 3:08pm

That problem sound’s somewhat familiar to me. I actually found a pretty basic application that crashes in the same way, but it doesn’t have to run for hours: [url=“The Official NVIDIA Forums | NVIDIA”]The Official NVIDIA Forums | NVIDIA

kuisma · August 15, 2008, 6:38am

I have used the GTX 280 as a dedicated CUDA GPU. When I run X on it simultaneous, I get the same behavior as you, ending with a “the launch timed out and was terminated”. I find it quite interesting that the launch in question failing is a cudaMemcpy() and not a kernel of my own.

theMatrix, what GPU are you using? I’m using the ASUS GTX 280 TOP.

Edit: If I get a lock up or a “launch timed out” seems more random then related to if I run the X server or not. Right now my application froze, and so did the X server. Both host processes at 100% CPU. I really think this is a faulty GPU, and are getting it replaced.

– Kuisma

theMarix · August 15, 2008, 7:40am

Are you sure it is actually the memcpy failing? Did you do a cudaThreadSynchronize() before that? I was actually looking for a memcpy (and even free) bug for a long time until I thought of the fact that cublas calls might not be synchronized.

Actually it’s Marix, not Matrix :P The name was way before the movie and is from a completely different universe, but no offense taken ;). To answer the question, it’s a Gainward GTX 280. AFAIK all currently availably cards are reference designs so that shouldn’t make much of a difference.

If that’s actually the cause, then it looks like a lot of people are having faulty GPUs … <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ />

MatrixWarrior · May 20, 2017, 4:29pm

Use MatrixWarrior. It is available free at

http://www.fmslib.com/mkt/MatrixWarrior.shtml

njuffa · May 20, 2017, 6:26pm

I think it would be beneficial to forum participants to add a short description of MatrixWarrior and how exactly it can be used for hardware validation. I realize the purpose of the post is likely to entice readers to play with MatrixWarrior, but a few introductory sentences would seem useful.

Best I can tell MatrixWarrior is a demo application for FMSlib (an out-of-core solver with impressive performance), that can be used as a benchmarking and/or test app. It is easy to find how to use it for benchmarking purposes, I have yet to find information on how to use it for hardware validation.

How well can MaxtrixWarrior isolate GPU issues specifically? As an out-of-core solver it exercises GPUs, CPUs, system memory, PCIe interconnect, and mass storage, so I can see its utility for whole-system validation and/or stress testing (burn-in). However, if validation fails, any of the enumerated components could be the source of the problem. How would one pin-point that the failing subsystem is GPUs in general and a specific GPU in particular?

rer · May 24, 2017, 3:59pm

Posting here because it causes less cluster than starting a new topic…

I’m looking for a tool to do a quick test of a GPU’s health. Found references to healthmon and tried to find out more. Discovered that it is part of GPU Deployment Kit. Found a page with links for downloads, but it shows the most recent is February 2016. The same page says it is now part of CUDA Toolkit 8, which I already have installed. Looked all through the toolkit and its documentation, but couldn’t find anything about healthmon or GPU Deployment Kit. Found documentation for healthmon that tells how to use it - “once unpackaged.” Also says it’s deprecated - we should be using Nvidia Validation Suite instead. Found a site with lots of info on nvvs, but nothing about where to get it or how to install it.

Can someone help?

vacaloca · May 24, 2017, 5:24pm

For Linux environments, I would suggest gpu-burn.

There is no specific Windows equivalent for the above that I am aware of, but either FurMark or any of UNIGINE’s benchmarks are suitable.

Topic		Replies	Views
Instability Problems with GTX 280 CUDA Programming and Performance	7	3237	July 9, 2008
410.66 crash and system freeze under heavy load (Xid 8, Xid 38) Linux	13	2145	November 15, 2018
What would cause of 1-second GPU lockups in CUDA program? How to debug them beyond nvprof? CUDA Programming and Performance	4	849	June 3, 2017
GPU breaks down after error CUDA Programming and Performance	3	10791	November 16, 2010
GPU breaks down after error CUDA Programming and Performance	1	820	November 3, 2010
Issues with GTX280 and Mandelbrot CUDA Programming and Performance	31	19107	September 22, 2008
Ubuntu 22.04.1 - Xid 8 error causing X to freeze Linux	3	500	April 26, 2024
Xid 8 in various CUDA deep learning applications for Nvidia GTX 1080 Ti Linux	6	7720	October 14, 2021
Ubuntu 22.04 - GPU Falls off Bus - Unable to determine the device handle for GPU0000:01:00.0: Unknown Error Linux	10	2638	February 12, 2024
Memory Test on a PC GPU CUDA Setup and Installation	7	3087	May 22, 2019

GPU diagnostics How to test a GPU

Related topics