Memory Test on a PC GPU

I do have major problems in a Windows 10 to run CUDA and OpenCL programs. I do suspect HW failure on GPU. Anyone that knows of sw that can validate Memory on a nvidia GPU ?

There used to be a program called MemtestG80. No idea whether this is still a thing, or whether it works properly with modern GPUs. I have never used it.

From monitoring nightly test runs on a large-ish cluster of machines with hundreds of GPUs I know that memory failures on GPUs do occur, but they are fairly rare events: about as frequent as memory failures in the system memory of the host machines, and less frequent than power supply failures in the host machines. I assume you have carefully eliminated all other potential hypotheses regarding the root cause of the observed problems.

Incorrect handling of a GPU (e.g. lack of ESD protection during installation) could lead to permanent damage, but I haven’t personally observed such a case.

Ok. So GPU memory faults are not so common.

I run the Furmark stresstest and get lots of pixels like flake or mirra and eventually the application freezes with larger chunks of pixel blocks in wrong colour

Do you Think this could be a memory problem on GPU or other error ?

It’s probably been ten years since I last ran Furmark. Not a CUDA application, I would note, so I am not sure how this question wound up in a CUDA sub-forum. I do recall that it causes very high power draw and heats up the GPU quickly. Based on that I would say double check power supply and cooling for your GPU.

In my experience, DRAM failures on GPUs occur pretty much exclusively with old hardware that has been in use for a long time, say 5+ years. What GPU are you running, and how old is it?

Visual artifacts in 3D graphics can be due to any number of software and hardware related causes.

It started with errors in my OpenCL application. OpenCL error CL_OUT_OF_RESOURCES

The HW is a RTX 2080 Ti and it got plenty of memory for my opencl application.

The OpenCL library just suddenly returns that error and i also get garbage in the returned memory.

the temp of the Furmark is still rater low 55-60 when the problem starts to get worse. I still Think its a memory issue or something broken on the GPU. But still i have no proper hw testing sw

Have you checked whether MemtestG80 is still around and might be suitable for testing recent hardware? The techniques used for memory checking via software haven’t really changed since I was in school (in the 1980s), so any competently written CUDA program for GPU memory testing should theoretically still be serviceable today.

Depending on where you live, it might be pretty easy and straightforward to RMA recently purchased and potentially defective hardware, so that might be an alternative resolution to consider.

I know nothing about OpenCL other than that it exists, and I’ll point out that this is a CUDA forum.

yes. found a version

Final error count after 50 iterations over 1024 MiB of GPU memory: 6967 errors

And BTW. i dont Think NVIDIA has any other forum for their OpenCL implementation

I have also found out that my USB3 doesnt work exactly as it should. Could all of these errors be related to HW issues in the motherboard ?