Best solution for maximizing bandwidth? More then 5.7G H->D bandwidth except Tesla

Hi all,
I’m working on a cuda project and got arround 20x optimization. But I still want to get more benefit from cuda :rolleyes:

Currently, my H->D bandwidth is only 1.9G (pinned). So I want to buy a better hardware to speed up it. I ever read a topic in this forum: someone get 5.7G (pinned) H->D bandwidth by Intel X58, but I know on Tesla, it’s easy to get speed arround 70xG (it’s too expensive to me). So my question is:

  1. What’s the best solution for maximizing bandwidth except Tesla?
  2. Why Tesla can get 70xG bandwidth, can someone give me a detailed technical article about this?

Thanks!

You have several bandwidths intermingled.

H->D & D->H bandwidths are up to 6 GiB/s on MB’s with fast DRAM.
D->D bandwidths are up to 140 GB/s (theoretical) on GTX280. On Tesla the memory bandwidth is actually a bit lower, around 102 GB/s

You are correct. There is no magic on Tesla. I think I just misleaded by some articles. In fact, 70xG H->D bandwidth is out of my understanding.

Thanks a lot!

No, DRAM speed doesn’t matter for pinned memory.

On P45, what matters is the FSB. With a 1333MHz FSB, you get 5.5GB/s even with 800MHz DDR2. With an 800MHz FSB, you get 4GB/s.

On X58, i’m not sure exactly, but I think it’s fast any way you slice it. Also even non-pinned memory hits >5GB/s, which is very impressive.

I’m not sure about ATI boards, I think some do alright, some do really poorly. Is your 2GB/s pinned on an ATI board?

It is in fact memory speed on X58–DRAM clock, number of channels, and QPI speed.

Yes, someone in this forum got >5GB/s (pagable) on X58, but he get only arround 5.7GB/s for pinned memory. Why?

In fact, my MB is a very old version of Dell, 2GB/s is already a surprise to me. That’s why I need to buy a new one. Seems X58 is the best choice. What do you think?

As NV employee, do you know NV has any plan to solve this bottleneck? A new MB chip for cuda? I think million people as me are looking forward this External Media

Because the overhead of pageable memory is mostly DRAM → DRAM copies, so when those copies are very fast compared to PCI-E v2 speeds, the gain of pinned memory is decreasing.

Thanks, BTW do you think x58 + 9800GX2 is the best solution currently?

No. If you want to go X2 (ie, do the work of managing 2 GPUs) then just wait for the GTX295 X2. It’ll be out in a month.

Thanks, I know GTX295 is the best solution. But my bottleneck is switching from calculating to Disk/Memory IO. So, I can’t get big improvement on GTX295.

BTW, if my boss don’t care the budget, I will buy one. External Media

It should be a $450, which is less than you’ll pay for the i7+x58+DDR3 setup.

What do you mean that disk IO is your bottleneck? And why get 9800GX2 instead of GTX260?

$450? Sound great, I have discussed with my boss, he preferred more powerful GPU. So maybe GTX295(s) will go on its way :rolleyes:

In my project, disk IO occupied more than 40%, GPU calculating occupied arround 10% and rest of them are hard to tune.

Why I preferred duo GPU than more powerfull single GPU? Please correct me if my understanding is wrong: CUDA suggests to use CPU-GPU pair to maximum performance, it means one GPU will hold almost all resources of one CPU. In my testing duo CPU + single GPU == single CPU + single GPU. BTW, I read certain topic in this forum, someone already used 9800GX2 to double the performance. So…

Using dual GPUs will only double performance if you are almost completely compute bound with very little I/O. My program, for instance, which otherwise is capable of running the entire calculation on the GPU requires a lot of communication between the two GPUS in a dual-GPU mode. Because of the extreme slowness of this communication, 2 GPUs are only 1.4x faster than 1.

If your app is 40% disk I/O, you will get a better return on your investment by putting that money into a RAID array or a fast SSD disk, or otherwise investing your time into optimizing the I/O portion of your code. Going dual GPU is not a magic bullet and actually opens up a lot of new difficulties in programming.

Memory bandwidth bound kernels (that are nive and parallel) should benefit too in my imagination (as in, have not started to do multi-GPU yet, that will come next year)

I already ordered 8 HDs (Raid0 x4 or x8) to improve my disk IO. My scenario is using two processes to double performance. Correct me if i’m wrong, in CUDA one GPU will hold all resources of one CPU, I can’t get any benefit on one GPU two processes on my machine. So I think duo GPU is the best solution. What do you think?

If only 10% of your time is spent on the GPU, then forget waiting for the GTX295 and start figuring out how to speed up all the other stuff. Two cards won’t help much. Two processes that only use 10% of the GPU each should share 1 GPU pretty well.

SSD will be much faster than RAID 0, if you don’t need a lot of storage. Of course, simply adding RAM will speed up disk I/O as well, as the OS will use that for caching.

I guess I should have clarified. I meant inter-GPU I/O in that post. If an algorithm is memory bound on each GPU, but requires very little inter-GPU communication then you will of course still get good scaling.

Does it scale well for Hoomd? Some parts I can imagine will have not so nice properties.

All the individual calculations scale nicely. The problem is that the overall iterative steps require that the data updated on each GPU needs to be communicated to the others before the next iteration. Since it is all iterative, one step after the other, there are no opportunities for overlapping memory transfers and computation. HOOMD on 2 GPUs is about 1.3 to 1.4 times faster than on a single GPU, assuming you have very good H ↔ D transfer rates.