CUDA with SLI

marmot · October 5, 2010, 6:35pm

So my question has two parts. First Let me give you some constraints:

I need multi GPU support
I don’t have much money to spend

So with that out there, first I am wondering what kind of performance benefits SLI gives you. How do you specifically utilize SLI through the CUDA api? Here is my case:

say I have and array “float * devvalues1” that I allocate on GPU device 1. Say I want to copy devvalues1 on GPU 1 to a device allocated array “float * devvalues2” on device 2. When I do a cudaMemcpy(…) if I specify : cudaMemcpy(devvalues1,devvalues2,cudaMemcpyDeviceToDevice) will this automatically use the SLI interface?

So now for the second question:

I’ve heard a lot about CPU bottlenecks with multiple GPUs. 1. Will I have CPU bottleneck issues if I treat each kernel with a separate CPU thread (using MPI)? 2. Does SLI bypass the CPU to avoid bottlenecking? 3. How can I reduce my bottlenecking on the CPU (and is this only relevant for memory transfers if I use MPI to treat each kernel individually) - better processor, better MB, better Memory…?

Sorry if these questions are very general, but I am starting my PhD (similar to another thread here) and know now I need multi-gpu support (probably starting with 2 gpus - but would like support for 4 Way SLI - if in fact that helps).
I probably can’t spend more than $2k but if I make the colors (results) pretty enough maybe I can spend more. I’m almost positive I will build this machine myself to reduce cost (will be running linux unless there are comparability issues).

Thanks

marmot · October 5, 2010, 6:35pm

So my question has two parts. First Let me give you some constraints:

I need multi GPU support
I don’t have much money to spend

So with that out there, first I am wondering what kind of performance benefits SLI gives you. How do you specifically utilize SLI through the CUDA api? Here is my case:

say I have and array “float * devvalues1” that I allocate on GPU device 1. Say I want to copy devvalues1 on GPU 1 to a device allocated array “float * devvalues2” on device 2. When I do a cudaMemcpy(…) if I specify : cudaMemcpy(devvalues1,devvalues2,cudaMemcpyDeviceToDevice) will this automatically use the SLI interface?

So now for the second question:

I’ve heard a lot about CPU bottlenecks with multiple GPUs. 1. Will I have CPU bottleneck issues if I treat each kernel with a separate CPU thread (using MPI)? 2. Does SLI bypass the CPU to avoid bottlenecking? 3. How can I reduce my bottlenecking on the CPU (and is this only relevant for memory transfers if I use MPI to treat each kernel individually) - better processor, better MB, better Memory…?

Sorry if these questions are very general, but I am starting my PhD (similar to another thread here) and know now I need multi-gpu support (probably starting with 2 gpus - but would like support for 4 Way SLI - if in fact that helps).
I probably can’t spend more than $2k but if I make the colors (results) pretty enough maybe I can spend more. I’m almost positive I will build this machine myself to reduce cost (will be running linux unless there are comparability issues).

Thanks

avidday · October 5, 2010, 6:45pm

SLI has nothing to do with CUDA and vice versa. Consider them orthogonal.

avidday · October 5, 2010, 6:45pm

SLI has nothing to do with CUDA and vice versa. Consider them orthogonal.

ceearem · October 5, 2010, 8:42pm

As the other guy said, SLI can not (yet?) be used with CUDA. So syncing two or more GPUS always requires to do it other the GPU by copying data from the device to the host and from there again to the other GPU. If this is a serious performance issue depends on how much synchronization work you have to do. Just exchanging some flags doesnt take long, but shuffling hundreds of megabytes around can be a problem. Using MPI for multiple threads is quiet a common thing. If you stay on one node you might use openmp instead though. But MPI gives you the possibility to scale over multiple nodes later.

Cheers
Ceearem

ceearem · October 5, 2010, 8:42pm

As the other guy said, SLI can not (yet?) be used with CUDA. So syncing two or more GPUS always requires to do it other the GPU by copying data from the device to the host and from there again to the other GPU. If this is a serious performance issue depends on how much synchronization work you have to do. Just exchanging some flags doesnt take long, but shuffling hundreds of megabytes around can be a problem. Using MPI for multiple threads is quiet a common thing. If you stay on one node you might use openmp instead though. But MPI gives you the possibility to scale over multiple nodes later.

Cheers
Ceearem

aeronaut · October 6, 2010, 4:14pm

Marmot,

I think the information that you’re looking for (but hasn’t been spelled out here,) is that when people talk about using multiple GPUs on one computer, they do so by having the GPUs in multiple PCIe slots, and send compute jobs to them individually. They’re not using SLI. It would be great if one could link 2-3 cards with SLI and have them work as one big unit for CUDA. But AFAIK, not an option at this point in time. (Hint, hint, nVidia people, …)

Some issues to be aware of:

Each card needs 1-2 PCIe power connectors, of either the 6 or 8 pin type; the 6+2 style can drive either, and give you more flexibility with moving cards around between different machines. And each card needs a decent amount of power, a function of the card. Your power supply needs to have enough power on each PCIe power line - make sure that you aren’t in a position where you have enough total power, but multiple 12V nodes can’t power all of your cards at once. And as has been stated here a number of times, have some overhead. Power bottlenecks can produce issues in other parts of the system that are time consuming to diagnose. Spend a little more here and avoid these pitfalls.

Each card needs a PCIe slot on the motherboard (double width,) preferably with x16 bandwidth, but if you’re going for more than two, this won’t be possible for all of them. If you don’t send a lot of data back and forth from the CPU to the GPU, you can probably get away with x8 or lower (some users here report that x1 is sufficient for certain operations, and x4 is more than enough.)

If you’re using the latest and greatest 400 series cards, they run hot, so it’s nice to have some space between cards so the side fans on the card can draw air in. Pressing one fan against the back of another card reduces airflow, possibly shortening the working life of the card (but we don’t have much data on that as the 400 series cards have only been in service for ~ 6 mos.), and perhaps even reducing the stability/reliability of your calculations. But if you plan to put 4 cards into one machine, you’ll need a special case and motherboard, or have to cram them all against each other.

Also, if you’re just using one machine, you’ll need either video out on the motherboard, or take one slot and dedicate it to an output card. The MSI 980i-G65 motherboard has nVidia video out and can take 3 PCIe cards at x16/x8/x8, or two at x16/x16 (but they are all right next to each other.)

Finally, you’ll want a case that has good airflow. The Antec Three Hundred Illusion runs about $65 at NewEgg, and comes with 4 fans, including a top exhaust fan, with a spot for a side intake fan that blows right on where the graphics cards sit. Tom’s hardware consistently rates it as a great budget choice for a gaming case, with no frills, good airflow and washable air filters. For machines with multiple 470’s or higher, the Antec 902, Cooler Master Storm Scout, or Silverstone Raven RV02 have good cooling, to name a few.

Finally, CUDA computers share a lot of the same issues as high end gaming cases, so using their ideas as a base is a solid way to approach the issue. Check out this build: [url=“System Builder Marathon: TH's $2000 Hand-Picked Build | Tom's Hardware”]http://www.tomshardware.com/reviews/newegg...dware,2753.html[/url] , as well as the associated articles. But note that the CPU is less important in CUDA calculations that aren’t bandwidth limited, so the AMD solutions might be more cost effective, but that depends on your problem.

Regards,
Martin

aeronaut · October 6, 2010, 4:14pm

Marmot,

I think the information that you’re looking for (but hasn’t been spelled out here,) is that when people talk about using multiple GPUs on one computer, they do so by having the GPUs in multiple PCIe slots, and send compute jobs to them individually. They’re not using SLI. It would be great if one could link 2-3 cards with SLI and have them work as one big unit for CUDA. But AFAIK, not an option at this point in time. (Hint, hint, nVidia people, …)

Some issues to be aware of:

Each card needs 1-2 PCIe power connectors, of either the 6 or 8 pin type; the 6+2 style can drive either, and give you more flexibility with moving cards around between different machines. And each card needs a decent amount of power, a function of the card. Your power supply needs to have enough power on each PCIe power line - make sure that you aren’t in a position where you have enough total power, but multiple 12V nodes can’t power all of your cards at once. And as has been stated here a number of times, have some overhead. Power bottlenecks can produce issues in other parts of the system that are time consuming to diagnose. Spend a little more here and avoid these pitfalls.

Each card needs a PCIe slot on the motherboard (double width,) preferably with x16 bandwidth, but if you’re going for more than two, this won’t be possible for all of them. If you don’t send a lot of data back and forth from the CPU to the GPU, you can probably get away with x8 or lower (some users here report that x1 is sufficient for certain operations, and x4 is more than enough.)

If you’re using the latest and greatest 400 series cards, they run hot, so it’s nice to have some space between cards so the side fans on the card can draw air in. Pressing one fan against the back of another card reduces airflow, possibly shortening the working life of the card (but we don’t have much data on that as the 400 series cards have only been in service for ~ 6 mos.), and perhaps even reducing the stability/reliability of your calculations. But if you plan to put 4 cards into one machine, you’ll need a special case and motherboard, or have to cram them all against each other.

Also, if you’re just using one machine, you’ll need either video out on the motherboard, or take one slot and dedicate it to an output card. The MSI 980i-G65 motherboard has nVidia video out and can take 3 PCIe cards at x16/x8/x8, or two at x16/x16 (but they are all right next to each other.)

Finally, you’ll want a case that has good airflow. The Antec Three Hundred Illusion runs about $65 at NewEgg, and comes with 4 fans, including a top exhaust fan, with a spot for a side intake fan that blows right on where the graphics cards sit. Tom’s hardware consistently rates it as a great budget choice for a gaming case, with no frills, good airflow and washable air filters. For machines with multiple 470’s or higher, the Antec 902, Cooler Master Storm Scout, or Silverstone Raven RV02 have good cooling, to name a few.

Finally, CUDA computers share a lot of the same issues as high end gaming cases, so using their ideas as a base is a solid way to approach the issue. Check out this build: [url=“System Builder Marathon: TH's $2000 Hand-Picked Build | Tom's Hardware”]http://www.tomshardware.com/reviews/newegg...dware,2753.html[/url] , as well as the associated articles. But note that the CPU is less important in CUDA calculations that aren’t bandwidth limited, so the AMD solutions might be more cost effective, but that depends on your problem.

Regards,
Martin

marmot · October 6, 2010, 10:11pm

Thanks this is some of the info I was interested in. I’ve read a lot about the power hogging high end GPUs so I’m kind of worried about that… I don’t think I need video out at all if that matters - This machine is just a compute cluster so I don’t really care about video (If anything I will be using xforwarding).

Thanks again for your suggestions.

marmot · October 6, 2010, 10:11pm

Thanks this is some of the info I was interested in. I’ve read a lot about the power hogging high end GPUs so I’m kind of worried about that… I don’t think I need video out at all if that matters - This machine is just a compute cluster so I don’t really care about video (If anything I will be using xforwarding).

Thanks again for your suggestions.

den3b · October 13, 2010, 4:58am

Hey aeronaut,

I am looking for this MSI mobo here in Hong Kong but I can’ find it… and I don’t wanna buy it on internet and build the PC by myself :P.

Have you got any other suggestion for a good AM3 mobo which can take 3 GPUs with ease?

Thx a lot!

den3b · October 13, 2010, 4:58am