Multi-gpu programming

Letharion · October 1, 2009, 7:26am

Are there special considerations that should be made, or other best pratices to observe, when using several GPUs?

My problem:
I have a computer with a GTX260 and a GTX295.
If I run kernels on all three of them, sooner, rather then later, I get a segfault, and the graphics freeze. One needs to ssh in and reboot get back up and running.
I can run on each card independently.
The general pattern is that the segfault appears to occur on the second card that finishes a kernel. Sometimes it works for several kernel executions, but mostly the problem shows up really quickly.

I’m trying to determine the cause of this.

CaLu · October 1, 2009, 10:47am

Are you running on graphical environment?

luca

biebo · October 1, 2009, 11:16am

i have a similar problem, when i increase the size of data to be stored in GPU memory the graphical user interface crashes.

may be my GPU memory gets fully consumed.

Sarnath · October 1, 2009, 11:20am

Your Code Review ( i c u r already doing dat) should fix this.

Letharion · October 1, 2009, 12:46pm

@Calu
No, the computer in question does not run X, or other GUIs.

@Biebo
Even if your symptoms are similar, it doesn’t sound like the cause is the same, since I don’t run a gui.

@Sarnath
Yep, been at it for two days. Have no idea what’s up. I was hoping someone could help me in narrowing it down, since it only happens when I use several cards.

What is probably worth noting, is that I’m not really doing “multi-gpu” programming. I launch three seperate processes, and each one uses only one card.
Which has me very confused. I would have thought it absolutely impossible the processes to affect each other.

Letharion · October 1, 2009, 3:20pm

Just shooting wildly here:
When a unspecified launch error occurs, it’s always on the first memcpy device → host.
Does that ring a bell with anyone? Any suggestions on how to better locate the problem?

CaLu · October 2, 2009, 6:44am

Maybe you are running out of 5 seconds limit… Try to run outside graphical interface.

luca

theMarix · October 20, 2009, 1:43pm

Sounds to me like the GPU equivalent of a segfault. A unspecified launch error pretty much everytime means you are accessing memory outside of the allocated area, especially if it is reported on the first call after the kernel. You might want to try running your code through valgrind in emulation mode.