Hi folks. I’m using a GTX 280 (factory oc), and every few hours runtime my application freezes on the GPU. Either it is an application error, or a GPU hardware malfunction. Application error would of course be the most likely if it wasn’t so that it has never hung on any of my other GPUs (8800- and 9800-series). “kill -9” does not terminate the frozen task, and it’s not possible to start any other program (e.g. deviceQuery) neither. I need to reboot the system to get it working again.
Is there any way I can determine if this is due to a faulty GPU? Some diagnostics program or such?
I have used the GTX 280 as a dedicated CUDA GPU. When I run X on it simultaneous, I get the same behavior as you, ending with a “the launch timed out and was terminated”. I find it quite interesting that the launch in question failing is a cudaMemcpy() and not a kernel of my own.
theMatrix, what GPU are you using? I’m using the ASUS GTX 280 TOP.
Edit: If I get a lock up or a “launch timed out” seems more random then related to if I run the X server or not. Right now my application froze, and so did the X server. Both host processes at 100% CPU. I really think this is a faulty GPU, and are getting it replaced.
Are you sure it is actually the memcpy failing? Did you do a cudaThreadSynchronize() before that? I was actually looking for a memcpy (and even free) bug for a long time until I thought of the fact that cublas calls might not be synchronized.
Actually it’s Marix, not Matrix :P The name was way before the movie and is from a completely different universe, but no offense taken ;). To answer the question, it’s a Gainward GTX 280. AFAIK all currently availably cards are reference designs so that shouldn’t make much of a difference.
If that’s actually the cause, then it looks like a lot of people are having faulty GPUs … <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ />
I think it would be beneficial to forum participants to add a short description of MatrixWarrior and how exactly it can be used for hardware validation. I realize the purpose of the post is likely to entice readers to play with MatrixWarrior, but a few introductory sentences would seem useful.
Best I can tell MatrixWarrior is a demo application for FMSlib (an out-of-core solver with impressive performance), that can be used as a benchmarking and/or test app. It is easy to find how to use it for benchmarking purposes, I have yet to find information on how to use it for hardware validation.
How well can MaxtrixWarrior isolate GPU issues specifically? As an out-of-core solver it exercises GPUs, CPUs, system memory, PCIe interconnect, and mass storage, so I can see its utility for whole-system validation and/or stress testing (burn-in). However, if validation fails, any of the enumerated components could be the source of the problem. How would one pin-point that the failing subsystem is GPUs in general and a specific GPU in particular?
Posting here because it causes less cluster than starting a new topic…
I’m looking for a tool to do a quick test of a GPU’s health. Found references to healthmon and tried to find out more. Discovered that it is part of GPU Deployment Kit. Found a page with links for downloads, but it shows the most recent is February 2016. The same page says it is now part of CUDA Toolkit 8, which I already have installed. Looked all through the toolkit and its documentation, but couldn’t find anything about healthmon or GPU Deployment Kit. Found documentation for healthmon that tells how to use it - “once unpackaged.” Also says it’s deprecated - we should be using Nvidia Validation Suite instead. Found a site with lots of info on nvvs, but nothing about where to get it or how to install it.