Memory Test on a PC GPU

ToolTech · May 20, 2019, 8:48am

I do have major problems in a Windows 10 to run CUDA and OpenCL programs. I do suspect HW failure on GPU. Anyone that knows of sw that can validate Memory on a nvidia GPU ?

njuffa · May 20, 2019, 12:58pm

There used to be a program called MemtestG80. No idea whether this is still a thing, or whether it works properly with modern GPUs. I have never used it.

From monitoring nightly test runs on a large-ish cluster of machines with hundreds of GPUs I know that memory failures on GPUs do occur, but they are fairly rare events: about as frequent as memory failures in the system memory of the host machines, and less frequent than power supply failures in the host machines. I assume you have carefully eliminated all other potential hypotheses regarding the root cause of the observed problems.

Incorrect handling of a GPU (e.g. lack of ESD protection during installation) could lead to permanent damage, but I haven’t personally observed such a case.

ToolTech · May 21, 2019, 8:55am

Ok. So GPU memory faults are not so common.

I run the Furmark stresstest and get lots of pixels like flake or mirra and eventually the application freezes with larger chunks of pixel blocks in wrong colour

Do you Think this could be a memory problem on GPU or other error ?

njuffa · May 21, 2019, 1:20pm

It’s probably been ten years since I last ran Furmark. Not a CUDA application, I would note, so I am not sure how this question wound up in a CUDA sub-forum. I do recall that it causes very high power draw and heats up the GPU quickly. Based on that I would say double check power supply and cooling for your GPU.

In my experience, DRAM failures on GPUs occur pretty much exclusively with old hardware that has been in use for a long time, say 5+ years. What GPU are you running, and how old is it?

Visual artifacts in 3D graphics can be due to any number of software and hardware related causes.

ToolTech · May 21, 2019, 3:04pm

It started with errors in my OpenCL application. OpenCL error CL_OUT_OF_RESOURCES

The HW is a RTX 2080 Ti and it got plenty of memory for my opencl application.

The OpenCL library just suddenly returns that error and i also get garbage in the returned memory.

the temp of the Furmark is still rater low 55-60 when the problem starts to get worse. I still Think its a memory issue or something broken on the GPU. But still i have no proper hw testing sw

njuffa · May 21, 2019, 3:21pm

Have you checked whether MemtestG80 is still around and might be suitable for testing recent hardware? The techniques used for memory checking via software haven’t really changed since I was in school (in the 1980s), so any competently written CUDA program for GPU memory testing should theoretically still be serviceable today.

Depending on where you live, it might be pretty easy and straightforward to RMA recently purchased and potentially defective hardware, so that might be an alternative resolution to consider.

I know nothing about OpenCL other than that it exists, and I’ll point out that this is a CUDA forum.

ToolTech · May 21, 2019, 5:49pm

yes. found a version

Final error count after 50 iterations over 1024 MiB of GPU memory: 6967 errors

And BTW. i dont Think NVIDIA has any other forum for their OpenCL implementation

ToolTech · May 22, 2019, 10:20am

I have also found out that my USB3 doesnt work exactly as it should. Could all of these errors be related to HW issues in the motherboard ?

Topic		Replies	Views
experiences with EVGA GTX TITAN Superclocked - memtestG80 - UNDERclocking in Linux ? CUDA Programming and Performance	1	4291	May 28, 2013
S1070 device 0 broken Test case provided CUDA Programming and Performance	10	4386	June 9, 2009
GPU in state where results are not reproducible! CUDA Programming and Performance	50	17004	November 2, 2012
which took can check GTX 760 memory hardware error CUDA-MEMCHECK	0	1939	February 17, 2015
Hardware damage CUDA Programming and Performance	8	3473	August 6, 2009
Two 8800 GTX cards with Intel Core 2 Duo would this work? CUDA Programming and Performance	19	13156	October 2, 2007
memory errors on GTX 280 CUDA Programming and Performance	5	2801	May 28, 2009
CUDA 2.0 seems to fail for long executions multiple process on one card fail CUDA Programming and Performance	5	7477	June 16, 2008
Hardware failure following invalid memory access an expensive problem... CUDA Programming and Performance	3	2567	October 27, 2009
Better understanding of hardware issues CUDA Programming and Performance	1	2709	August 3, 2011

Memory Test on a PC GPU

Related topics