GPU diagnostics How to test a GPU

Hi folks. I’m using a GTX 280 (factory oc), and every few hours runtime my application freezes on the GPU. Either it is an application error, or a GPU hardware malfunction. Application error would of course be the most likely if it wasn’t so that it has never hung on any of my other GPUs (8800- and 9800-series). “kill -9” does not terminate the frozen task, and it’s not possible to start any other program (e.g. deviceQuery) neither. I need to reboot the system to get it working again.

Is there any way I can determine if this is due to a faulty GPU? Some diagnostics program or such?

Regards,

  • Kuisma

Looks like no one has shot you a response yet, but I am needing to find a method to test my GPU as well. GeForce 8800GT.

The closest I have seen is GPU-Z which only gives allot of detailed information about your card, but doesnt run any diagnostics.

Please update us if you find anything suitable in the meantime. Good Luck!!!

I haven’t heard of any GPU diagnostic programs either. All I can suggest is to run a known stable CUDA app and see if you get the same behavior.

I can suggest my own app which is known to be stable for week+ runs on G80 and G92 (unfortunately, I don’t have a G200 to test…). In fact, I just posted a request for a benchmark: http://forums.nvidia.com/index.php?showtop…ndpost&p=425624

If you modify the benchmark script with run(20e6) instead of run(50000), it will run for many hours.

edit: scratch that, theMarix nicely ran the benchmark and found that HOOMD isn’t stable on GTX 280 :(

That problem sound’s somewhat familiar to me. I actually found a pretty basic application that crashes in the same way, but it doesn’t have to run for hours: http://forums.nvidia.com/index.php?showtopic=74853

I have used the GTX 280 as a dedicated CUDA GPU. When I run X on it simultaneous, I get the same behavior as you, ending with a “the launch timed out and was terminated”. I find it quite interesting that the launch in question failing is a cudaMemcpy() and not a kernel of my own.

theMatrix, what GPU are you using? I’m using the ASUS GTX 280 TOP.

Edit: If I get a lock up or a “launch timed out” seems more random then related to if I run the X server or not. Right now my application froze, and so did the X server. Both host processes at 100% CPU. I really think this is a faulty GPU, and are getting it replaced.

– Kuisma

Are you sure it is actually the memcpy failing? Did you do a cudaThreadSynchronize() before that? I was actually looking for a memcpy (and even free) bug for a long time until I thought of the fact that cublas calls might not be synchronized.

Actually it’s Marix, not Matrix :P The name was way before the movie and is from a completely different universe, but no offense taken ;). To answer the question, it’s a Gainward GTX 280. AFAIK all currently availably cards are reference designs so that shouldn’t make much of a difference.

If that’s actually the cause, then it looks like a lot of people are having faulty GPUs … <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ />

Use MatrixWarrior. It is available free at

http://www.fmslib.com/mkt/MatrixWarrior.shtml

I think it would be beneficial to forum participants to add a short description of MatrixWarrior and how exactly it can be used for hardware validation. I realize the purpose of the post is likely to entice readers to play with MatrixWarrior, but a few introductory sentences would seem useful.

Best I can tell MatrixWarrior is a demo application for FMSlib (an out-of-core solver with impressive performance), that can be used as a benchmarking and/or test app. It is easy to find how to use it for benchmarking purposes, I have yet to find information on how to use it for hardware validation.

How well can MaxtrixWarrior isolate GPU issues specifically? As an out-of-core solver it exercises GPUs, CPUs, system memory, PCIe interconnect, and mass storage, so I can see its utility for whole-system validation and/or stress testing (burn-in). However, if validation fails, any of the enumerated components could be the source of the problem. How would one pin-point that the failing subsystem is GPUs in general and a specific GPU in particular?

Posting here because it causes less cluster than starting a new topic…

I’m looking for a tool to do a quick test of a GPU’s health. Found references to healthmon and tried to find out more. Discovered that it is part of GPU Deployment Kit. Found a page with links for downloads, but it shows the most recent is February 2016. The same page says it is now part of CUDA Toolkit 8, which I already have installed. Looked all through the toolkit and its documentation, but couldn’t find anything about healthmon or GPU Deployment Kit. Found documentation for healthmon that tells how to use it - “once unpackaged.” Also says it’s deprecated - we should be using Nvidia Validation Suite instead. Found a site with lots of info on nvvs, but nothing about where to get it or how to install it.

Can someone help?

For Linux environments, I would suggest gpu-burn.

There is no specific Windows equivalent for the above that I am aware of, but either FurMark or any of UNIGINE’s benchmarks are suitable.