PNY rtx a5000 artifacts / black screens / restarts

Hi there,

I’m having issues with the A5000. I sometimes get random artifacts these don’t necessarily come at any particular time. The screen often turns black then returning to normal (after having crashed my Stable Diffusion Automatic 1111) or a completely restart seems to happen when I push the graphics card. However not more than 8.6GB of memory is ever in use when this happens.

When running Stable Diffusion I am getting this error a lot
cuda error: an illigal memory access was encountered

But I’m concerned that my hardware might be faulty given the number of strange things that are happening. Unfortunately I bought the card sealed ‘as new’ off ebay so there is no possibility of a return.

I have tried fully removing all graphics cards drivers and updating to the current latest version, this has not helped. Now I’m at a bit of a loss for what to try next. I am currently running windows 11.

Any help would be much appreciated as I’ve wasted days, updating windows, changing settings on software, reading endless blog posts to no avail.

Hi there @kia.coates, welcome to the NVIDIA developer forums.

There is a couple of things to try, but to be honest from your description it does indeed sound like some HW issue. If the SD algorithm you are using is known to work correctly on other GPU Hardware, than the CUDA error might very well be a hardware memory fault.

Please forgive me if you have tried some of these already, but I cannot know that beforehand of course.

  • Try re-seating the GPU in your mainboard. Bad PCIe connection can cause this.
  • Try a different PCIe slot
  • Try the card in a different PC and/or with a different OS like Ubuntu
  • Make sure the GPU power connectors are properly connected
  • Make sure the GPU gets sufficient cooling. The card can run quite hot depending on work load and the single fan might not be sufficient if the case itself does not have good airflow
  • Check temperatures! MSI Afterburner is a helpful tool here that will also show other errors and temp/power throttling in its graphs. You can also use nvidia-smi dmon for this, check out nvidia-smi dmon -h for details on usage.

That’s it. I sincerely hope one of these will bring you to some solution!