I have been trying to run simulations using the program NAMD (http://www.ks.uiuc.edu/Research/namd/).
When I run the simulations using only CPUs they work fine. But when I run them using 4CPUs and 1 or 2 GPUs, the simulations lose contact with the GPU, effectively stopping the simulations. NAMD stops printing output that it is set to regularly do, but the program does not die properly as it keeps looking for GPUs, that it correctly had found at the initiation of the simulation. The time before the program dies seems to be random. It dies at different time steps when I restart the simulation from scratch. Since the program does not properly die, but instead keeps polling the GPUs, there is no error log created.
I have sent my files to the NAMD developers and they have not to my knowledge been able to reproduce my problems, which means there is no problem with the input files. There is nothing wrong with my hardware (2 GeForce 580GTX cards). Both the program and the OS I use are 64 bit. I have reproduced these problems on Fedora 12 and Ubuntu 10.10.
I have tried using both the 260.19.36 and 260.19.29 drivers.
I just wanted to see if anyone here has any clue as to what the problem can be and how I should trouble shoot? Something interrupts the communication between the program and GPUs. Is there a program I can disable/blacklist/uninstall that would help me? How can I ensure that the communication between the program and the GPUs are kept throughout the simulation? Is it a driver issue?
Any help would be greatly appreciated!