GTX 780 Ti issue

Hello,

We met an error frequently when running CUDA programs (MD package HOOMD or GALAMOST) in GTX 780Ti machines!
The throwing error is:
RuntimeError: an illegal memory access was encountered
CUDA Error
or:
***Error! unspecified launch failure
CUDA Error

We are very sure this problem comes from GTX 780ti.
The 10 machines (total 20 GTX 780Ti) with different configurations all suffer this problem!
If the other cards (such as GTX 780 and GTX 680) were installed in the same machine, the problem disappeared.
Every version of driver and cuda tookit has been tested. And the brand of GTX780Ti is leadtex.

Do you have the 780 Ti or the 780 Ti OC Triple Fan (leadtek makes two versions of the product)?
Windows or linux? (Seems like linux - looks like GALAMOST only has linux version.)

The 780 Ti requires 250W, so with 2 in a system that is 500W just for GPU. If you have anything less than 850W PSU, you are probably asking for trouble. Even 850W may be borderline, especially if there are 2 CPUs in the system. Cards like GTX 680 will draw significantly less power. To isolate this, try removing one of the two 780 Ti from a system, and see if the problem still occurs.

You might also want to check with leadtek to see if there are any VBIOS updates for your card, although I didn’t see any with a quick look at the Leadtek website.

You should also make sure you have adequate cooling. A simple way to check this is to run nvidia-smi in a loop while you are running your test, and monitor the GPU temperatures.

Thank you very much for your professional reply.
We have 780 Ti OC Triple Fan (there are three fans in a card, so I guess it is 780 Ti OC Triple Fan versition). The OS is linux and only 1 CPU in the system. The rated output power of PSU is 1100W. I guess it is not overheated. The GPU temperature keeps about 60-70 C when program running.

Would it be possible for you to provide a set of instructions that would allow someone else to reproduce the issue you are seeing? for example:

  1. download hoomd-blue from here
  2. build it like this
  3. use this input deck
  4. execute it with this command line

Or something like that. It is fine if it is GALAMOST or some other application, as well.

I would also encourage you to check with Leadtek from time to time for a VBIOS update.

I suggest to reproduce the issue with HOOMD which is more widely used.

  1. download hoomd-blue from http://codeblue.umich.edu/hoomd-blue/download.html, v0.11.3 is recommended.

  2. build it according to the guide http://codeblue.umich.edu/hoomd-blue/doc/page_install_guide.html.

  3. you can use arbitrary script to run hoomd, however, here is the one which we recently used testdpd0.92-refine.hoomd · GitHub

  4. run the script by command “hoomd testdpd0.92-refine.hoomd --gpu=0&”

We are checking VBIOS issue with Leadtek. We appreciate your suggestion very much.