Are there special considerations that should be made, or other best pratices to observe, when using several GPUs?
I have a computer with a GTX260 and a GTX295.
If I run kernels on all three of them, sooner, rather then later, I get a segfault, and the graphics freeze. One needs to ssh in and reboot get back up and running.
I can run on each card independently.
The general pattern is that the segfault appears to occur on the second card that finishes a kernel. Sometimes it works for several kernel executions, but mostly the problem shows up really quickly.
No, the computer in question does not run X, or other GUIs.
Even if your symptoms are similar, it doesn’t sound like the cause is the same, since I don’t run a gui.
Yep, been at it for two days. Have no idea what’s up. I was hoping someone could help me in narrowing it down, since it only happens when I use several cards.
What is probably worth noting, is that I’m not really doing “multi-gpu” programming. I launch three seperate processes, and each one uses only one card.
Which has me very confused. I would have thought it absolutely impossible the processes to affect each other.
Just shooting wildly here:
When a unspecified launch error occurs, it’s always on the first memcpy device → host.
Does that ring a bell with anyone? Any suggestions on how to better locate the problem?
Sounds to me like the GPU equivalent of a segfault. A unspecified launch error pretty much everytime means you are accessing memory outside of the allocated area, especially if it is reported on the first call after the kernel. You might want to try running your code through valgrind in emulation mode.