Shopping-list for Cuda GPGPU System in 800-1000 euro price-range Goal: A 'budget' GTX 470 (F

seibert · April 9, 2010, 8:26pm

It’s hard to say given that the GTX 470 is not quite out yet, and so hasn’t driven the prices of the GTX 285 down. (Conversely, the initial low quantities of GTX 400 cards will probably cause the price to spike above MSRP until the supply improves.) Right now both cards have the same price approximately (which can’t be true once the GTX 470 is available in quantity), but the GTX 470 has a 50% higher peak double precision rate.

So I think the 470 will be the best double precision GFLOPS/$ in about a month, once things settle down.

marijnfs · April 9, 2010, 8:49pm

Wow thanks, thats some detailed advice! I can wait a month. I am still coding the algorithm (or algorithms, it is not like I already know what datamining algorithm is right for the job!). You really notice how NOT multicore programming is considered in traditional language, C++ in my case. We can really use some knew programming concepts;) Only after all this can I start thinking about scaling up to GPU algorithms. I will watch the price of the GTX470 closely, know!

marijnfs · April 9, 2010, 9:11pm

Does this mean anything in terms readable in that codename they give motherboards?

avidday · April 9, 2010, 10:33pm

Not really, but I wouldn’t seriously suggest overclocking the HT link. On the 790FX system by my feet, a pair of GTX-275s can hit a combined transfer speed of about 6.0 Gb/s transfer speed when copying simultaneously. One card can hit about 5.0 Gb/s. HT overclocking can get the 6.0Gb/s number up to about 6.75-7 Gb/s, but does nothing for the single GPU copy case.

marijnfs · April 9, 2010, 11:00pm

6.0 Gb/s should be enough. But these transfer speeds would not come in to play in an ordinary algorithm, am I right? When I have my datastructures in the 1 something gigabyte memory of the graphics card the transfer capabilities would not matter anymore. Is that correct

seibert · April 9, 2010, 11:23pm

Correct, if you can amortize the time required to load the card over a long calculation. Keep in mind that we are talking about speeds where you could fill the entire memory of a GTX 285 in 0.2 seconds. If you are going to reuse this data many times in your kernels, then that 0.2 seconds at the beginning in negligible.

You only have to worry about PCI-Express bandwidth if you need to stream a lot of data onto the card or stream a lot of results off the card. Different applications need different things, so you don’t want to limit your options with horrible PCI-Express performance (like an x4 link), but unless you have a specific need, you don’t need to go crazy trying to maximize PCI-Express bandwidth either.

avidday · April 10, 2010, 9:08am

Yes, in most applications, it isn’t so important.

My case (thread here if you are interested) was unusual. I have a multi-gpu code using pthreads which does out of core dense matrix multiplication. The code takes very large dense matrix products and decomposes into a series of smaller products that get pushed to multiple GPUs and solved separately (often many times in a single operation), and then reassembles the final product matrix in host memory from the partial results computed on the GPUs. So it is moving GPU memory sized chunks of memory (1-2Gb) simultaneously around multiple GPUs, which is probably the absolute worst case scenarios for PCI-e bandwidth usage. Even then, the code only spends something like 2 second moving 10Gb of matrix data about, which is only about 5% of the total operation time.

marijnfs · April 10, 2010, 12:17pm

Yes, in most applications, it isn’t so important.

My case (thread here if you are interested) was unusual. I have a multi-gpu code using pthreads which does out of core dense matrix multiplication. The code takes very large dense matrix products and decomposes into a series of smaller products that get pushed to multiple GPUs and solved separately (often many times in a single operation), and then reassembles the final product matrix in host memory from the partial results computed on the GPUs. So it is moving GPU memory sized chunks of memory (1-2Gb) simultaneously around multiple GPUs, which is probably the absolute worst case scenarios for PCI-e bandwidth usage. Even then, the code only spends something like 2 second moving 10Gb of matrix data about, which is only about 5% of the total operation time.

Thanks, very informative! I guess the main limitation on algorithm performance will be the algorithm itself! I have yet to dive into programming algorithms for the GPU, but I saw already some implementations of neural networks. These networks generally don’t become very big and the datasets they are learnt with can easily fit in the GPU memory. I am hoping that the complete datastructures (neural nets & dataset) can fit into the GPU memory, then it will really be fast!

Is there anyone who has experience in these types of algorithms?

seibert · April 10, 2010, 3:46pm

I don’t have any experience with this, but I’m curious how much memory is needed to represent your neural net. If it will fit in the Fermi L2 cache (768 kB), I could imagine some fantastic speedups. Then you will only have to hit global memory to stream a dataset to each thread.

This is getting off-topic, but will there be a way to specify that a global read is “single use” in CUDA at the C level? I see that PTX 2.0 has cache modifiers to support this, and avoid cache pollution with data that you know you will only use once per kernel. It would be nice to have some decoration/function/whatever to tell the compiler it should be generating ld.cs instructions.

marijnfs · April 10, 2010, 5:47pm

Well, the weight values of the neural network take up the most memory. I will train neural networks a few layers deep. In my case this means setting the input weights of the network to the gray-values of an image (take 64 * 64 = 4096 float values). Then this passes through the networks layers, consisting for example of the following layer sizes: 500, 200, 50, 200, 500 (an extreme case btw). Then the network maps back to an output vector which is compared to the original image, and back-propagated with the resulting error.

(4000 * 500 + 500 * 200 + 200 * 50) * 2 = 4 000 000 float values which should be about 16mb to represent this network. That will not fit in L2 cache of course, but a less naive approach should be able to wisely split this problem up. I don’t want to optimize to much on this example, because the structure of the network can change. Possibly several input-networks will be connected to one bigger layer, where each network only sees a part of the image. But I will move to the “CUDA Programming and Development” at that time!

I also have a question! I found a Gigabyte motherboard with the 790fx chipset for E110. Can anyone tell me if it has big disadvantages compared to the current motherboard:

Gigabyte GA-790XTA-UD4, 110 euro

http://azerty.nl/0-1126-249909/gigabyte-ga…ud4-rev-1-.html

as opposed to:

MSI 790FX-GD70, 145 euro

http://azerty.nl/0-1126-204551/msi-790fx-g…oederbord-.html

avidday · April 10, 2010, 5:55pm

That isn’t a 790FX chipset, it is a 790X chipset, which only gives you 1x16 lane PCI-e V2 and 1x8 lane PCI-e v2 in a 16 lane mechanical socket. Still a good board, but the chipset has less total lanes than the 790FX, and hence not two full x16 slots.

seibert · April 10, 2010, 6:07pm

In the first version of my post, I suggested this board. Then I discovered that the description on the azerty website was wrong, as aviday points out, and edited the post to suggest the MSI motherboard instead. :)

marijnfs · April 10, 2010, 6:23pm

Ah that clarifies:) So that cheaper motherboard does not support that chipset? Then I will leave it this way.

Do you know of other ways to cheaply push down the price a little?

avidday · April 10, 2010, 6:38pm

You could probably find a slightly cheaper power supply than that Antec, but otherwise I don’t see much “fat to trim”.

I am presently have a very similar rig under the desk warming my toes as I type this:

Phenom II x4 945
Gigabyte MA790FXT-UD5P
2 x Gigabyte GTX-275
8 Gb DDR3-1333 (Corsair IIRC)
Samsung F1 1Gb drive
Coolermaster HAF922
Corsair 750W PSU

No complaints (except that it is like a vacuum cleaner in that it accumulates startling amount of dust and fluff, even though we live in a regularly vacuumed place (god bless the Roomba) with polished wooden floor and no carpets. Performance is excellent (I have hit 160GFLOP/s in Linpack with this rig).

marijnfs · April 10, 2010, 7:04pm

Wow that is very similar. Indeed I think I can cut costs in the power supply – and in the chassis, too. But this is probably close to the lowest starting cost for a reasonable GPU computing system.

BTW, do I have to pay attention to cooling? Like buying extra fans. Or are these all included with the computer parts?

avidday · April 10, 2010, 7:50pm

If you are buying a “boxed” cpu, there will be an acceptable heatsink/fan included. The HAF922 has 3 big fans - a front lower intake fan and then two large extractor fans at the top and rear. I have found those + the PSU to be adequate for keeping everything cool. For some very compute intensive and long jobs, I find it necessary to ramp up the fans on the GPUs a bit (for example I have on 275 which has been computing basically non-stop for most of today at it is sitting at a steady 70C core temperature). The linux divers I use don’t seem to have smart fan control included, so I usually just run the fans on active GPUs up by hand to around 65-70% speed for the duration of the compute job and then down afterwards.

marijnfs · April 11, 2010, 7:26pm

Ok that sounds like enough fans. That linux driver problem sounds weird, you shouldn’t have to control the fans manually!

I was wondering what the system would look like if I did not need the opportunity to upgrade to two cards. Such a system could be even cheaper! I just want to know what my options are;)

avidday · April 11, 2010, 8:34pm

It is because the card is a non display card. The display driver does the fan speed control when it is attached to a device, but this one isn’t. There is a command line utility that can be used to control the fan instead.

You potentially could save a good amount of money. Personally, I would go for a board with an onboard GPU (or supporting on-die graphics in the case of Core i3), and a single 16 lane slot for the compute GPU. That actually gives a number of advantages, especially when debugging on the hardware. You should be able to get buy with a 550-600w PSU and a smaller case (although fans in the right places are important).

A single, fully dedicated Fermi class GPU for computing is a rather large and impressive device with a lot of capacity. You might find it is all you ever need.

marijnfs · April 12, 2010, 2:26pm

It is because the card is a non display card. The display driver does the fan speed control when it is attached to a device, but this one isn’t. There is a command line utility that can be used to control the fan instead.

You potentially could save a good amount of money. Personally, I would go for a board with an onboard GPU (or supporting on-die graphics in the case of Core i3), and a single 16 lane slot for the compute GPU. That actually gives a number of advantages, especially when debugging on the hardware. You should be able to get buy with a 550-600w PSU and a smaller case (although fans in the right places are important).

A single, fully dedicated Fermi class GPU for computing is a rather large and impressive device with a lot of capacity. You might find it is all you ever need.

I am now trying to apply your advice, but I a not sure what type of motherboard I have to choose. Could you give an example?

Also, what are the differences of gtx470 cards by different company’s, because there are significant price differences? To give an example:

Point of view GTX470, Euro 409

http://azerty.nl/8-971-268106/point-of-vie…?tab=tech_specs

ASUS GTX470, Euro 318!

http://azerty.nl/8-971-268104/asus-engtx47…?tab=tech_specs

Could you elaborate on that?

seibert · April 12, 2010, 3:34pm

Since all companies get the same GPUs from NVIDIA (and initially even use the same reference design for the entire board), they can only price differentiate themselves in a few ways:

Overclocking. Many vendors offer the same card with different levels of overclocking from the factory. Presumably, all they are doing is testing boards with different clock rates and binning them based on which ones still work at higher speeds. Some vendors also include tools to assist in overclocking. Given that CUDA jobs can be pretty stressful on a card (100% load for hours or days), you should stay away from overclocked cards. They don’t offer enough benefit, and a long job might push them over the edge.
Device memory. Sometimes you’ll see (usually well into a product cycle) a vendor double the capacity of the memory chips on a board. This tends to be a niche market (I only ever saw a handful of 2 GB GTX 285 cards).
Branding. Brands develop reputations for good (or bad) customer service, style, whatever.
Short term market pressures. This is my best guess in the case of the two cards you link to. With GTX 470 supplies very low and erratic, some vendors might be jacking their prices up to squeeze a little more profit out before the supply chain calms down. Boards priced at the MSRP seem to be selling out very fast, leaving customers to either wait, or pay a little more to the vendors who raised their prices.