Best solution for maximizing bandwidth? More then 5.7G H->D bandwidth except Tesla

nightingx9 · December 20, 2008, 10:08am

Hi all,
I’m working on a cuda project and got arround 20x optimization. But I still want to get more benefit from cuda :rolleyes:

Currently, my H->D bandwidth is only 1.9G (pinned). So I want to buy a better hardware to speed up it. I ever read a topic in this forum: someone get 5.7G (pinned) H->D bandwidth by Intel X58, but I know on Tesla, it’s easy to get speed arround 70xG (it’s too expensive to me). So my question is:

What’s the best solution for maximizing bandwidth except Tesla?
Why Tesla can get 70xG bandwidth, can someone give me a detailed technical article about this?

Thanks!

E.D_Riedijk · December 20, 2008, 1:48pm

You have several bandwidths intermingled.

H->D & D->H bandwidths are up to 6 GiB/s on MB’s with fast DRAM.
D->D bandwidths are up to 140 GB/s (theoretical) on GTX280. On Tesla the memory bandwidth is actually a bit lower, around 102 GB/s

nightingx9 · December 20, 2008, 3:42pm

You are correct. There is no magic on Tesla. I think I just misleaded by some articles. In fact, 70xG H->D bandwidth is out of my understanding.

Thanks a lot!

alex_dubinsky · December 20, 2008, 7:48pm

No, DRAM speed doesn’t matter for pinned memory.

On P45, what matters is the FSB. With a 1333MHz FSB, you get 5.5GB/s even with 800MHz DDR2. With an 800MHz FSB, you get 4GB/s.

On X58, i’m not sure exactly, but I think it’s fast any way you slice it. Also even non-pinned memory hits >5GB/s, which is very impressive.

I’m not sure about ATI boards, I think some do alright, some do really poorly. Is your 2GB/s pinned on an ATI board?

tmurray · December 20, 2008, 8:05pm

It is in fact memory speed on X58–DRAM clock, number of channels, and QPI speed.

nightingx9 · December 21, 2008, 3:21am

Yes, someone in this forum got >5GB/s (pagable) on X58, but he get only arround 5.7GB/s for pinned memory. Why?

In fact, my MB is a very old version of Dell, 2GB/s is already a surprise to me. That’s why I need to buy a new one. Seems X58 is the best choice. What do you think?

nightingx9 · December 21, 2008, 3:24am

As NV employee, do you know NV has any plan to solve this bottleneck? A new MB chip for cuda? I think million people as me are looking forward this External Media

E.D_Riedijk · December 21, 2008, 1:41pm

Because the overhead of pageable memory is mostly DRAM → DRAM copies, so when those copies are very fast compared to PCI-E v2 speeds, the gain of pinned memory is decreasing.

nightingx9 · December 22, 2008, 12:40pm

Thanks, BTW do you think x58 + 9800GX2 is the best solution currently?

alex_dubinsky · December 22, 2008, 8:19pm

No. If you want to go X2 (ie, do the work of managing 2 GPUs) then just wait for the GTX295 X2. It’ll be out in a month.

nightingx9 · December 23, 2008, 2:17am

Thanks, I know GTX295 is the best solution. But my bottleneck is switching from calculating to Disk/Memory IO. So, I can’t get big improvement on GTX295.

BTW, if my boss don’t care the budget, I will buy one. External Media

alex_dubinsky · December 23, 2008, 3:27am

It should be a $450, which is less than you’ll pay for the i7+x58+DDR3 setup.

What do you mean that disk IO is your bottleneck? And why get 9800GX2 instead of GTX260?

nightingx9 · December 23, 2008, 6:34am

$450? Sound great, I have discussed with my boss, he preferred more powerful GPU. So maybe GTX295(s) will go on its way :rolleyes:

In my project, disk IO occupied more than 40%, GPU calculating occupied arround 10% and rest of them are hard to tune.

Why I preferred duo GPU than more powerfull single GPU? Please correct me if my understanding is wrong: CUDA suggests to use CPU-GPU pair to maximum performance, it means one GPU will hold almost all resources of one CPU. In my testing duo CPU + single GPU == single CPU + single GPU. BTW, I read certain topic in this forum, someone already used 9800GX2 to double the performance. So…

MisterAnderson42 · December 23, 2008, 1:12pm

Using dual GPUs will only double performance if you are almost completely compute bound with very little I/O. My program, for instance, which otherwise is capable of running the entire calculation on the GPU requires a lot of communication between the two GPUS in a dual-GPU mode. Because of the extreme slowness of this communication, 2 GPUs are only 1.4x faster than 1.

If your app is 40% disk I/O, you will get a better return on your investment by putting that money into a RAID array or a fast SSD disk, or otherwise investing your time into optimizing the I/O portion of your code. Going dual GPU is not a magic bullet and actually opens up a lot of new difficulties in programming.

E.D_Riedijk · December 23, 2008, 3:29pm

Memory bandwidth bound kernels (that are nive and parallel) should benefit too in my imagination (as in, have not started to do multi-GPU yet, that will come next year)

nightingx9 · December 24, 2008, 2:02am

I already ordered 8 HDs (Raid0 x4 or x8) to improve my disk IO. My scenario is using two processes to double performance. Correct me if i’m wrong, in CUDA one GPU will hold all resources of one CPU, I can’t get any benefit on one GPU two processes on my machine. So I think duo GPU is the best solution. What do you think?

alex_dubinsky · December 24, 2008, 4:56am

If only 10% of your time is spent on the GPU, then forget waiting for the GTX295 and start figuring out how to speed up all the other stuff. Two cards won’t help much. Two processes that only use 10% of the GPU each should share 1 GPU pretty well.

SSD will be much faster than RAID 0, if you don’t need a lot of storage. Of course, simply adding RAM will speed up disk I/O as well, as the OS will use that for caching.

MisterAnderson42 · December 24, 2008, 2:44pm

I guess I should have clarified. I meant inter-GPU I/O in that post. If an algorithm is memory bound on each GPU, but requires very little inter-GPU communication then you will of course still get good scaling.

E.D_Riedijk · December 25, 2008, 10:16am

Does it scale well for Hoomd? Some parts I can imagine will have not so nice properties.

MisterAnderson42 · December 25, 2008, 6:39pm

All the individual calculations scale nicely. The problem is that the overall iterative steps require that the data updated on each GPU needs to be communicated to the others before the next iteration. Since it is all iterative, one step after the other, there are no opportunities for overlapping memory transfers and computation. HOOMD on 2 GPUs is about 1.3 to 1.4 times faster than on a single GPU, assuming you have very good H ↔ D transfer rates.

Topic		Replies	Views
The fastest platform of GPU computing CUDA Programming and Performance	38	40553	January 19, 2010
Memory bandwidth CUDA Programming and Performance	31	38666	October 5, 2007
Strange bandwidthTest results with new hardware Lower, and asymetric H->D, D->H CUDA Programming and Performance	18	26371	February 8, 2010
Is this PCIe 2.0 bandwidth low? 3.1 GB/s pinned CUDA Programming and Performance	45	20392	December 28, 2008
PCI Express x16 bandwidth - host<->device transfer Bandwidth is much lower than should be CUDA Programming and Performance	38	68332	April 18, 2008
Bandwidht Usage CUDA Programming and Performance	16	9015	October 30, 2008
Lower then expected bandwidth on C2050 CUDA Programming and Performance	11	9170	October 26, 2010
Concurrent bandwidth test CUDA Programming and Performance	30	46498	April 27, 2012
low transfer bandwidth between CPU and GPU my GTX 580 has a slow transfer speed CUDA Programming and Performance	9	3731	August 10, 2011
x58 Chipset PCIE Bandwidth Any improvement? CUDA Programming and Performance	47	22319	December 14, 2008

Best solution for maximizing bandwidth? More then 5.7G H->D bandwidth except Tesla

Related topics