Best solution for maximizing bandwidth? More then 5.7G H->D bandwidth except Tesla

E.D_Riedijk · December 25, 2008, 7:17pm

Just to show that I have no background at all in HPC :">

How does that compare to the scaling achieved in clusters? My guess is that this is not too bad compared to having 2 Computers connected to each other by means of a fast network.

I think you would also benefit from D1 → D2 transfers without a roundtrip to CPU memory. I think it is time for a new wishlist post ;)

alex_dubinsky · December 25, 2008, 7:42pm

Sounds like you could run 2 jobs on 4 cards with good efficiently. (Twice as fast as 2 jobs on 2 cards.) Also, how are you doing your D1->D2 copies? Streaming using a small buffer will be faster than copying it all in then copying it all back. (Almost like a straight PCIe transfer.)

MisterAnderson42 · December 25, 2008, 9:19pm

MD on clusters scale much better, because the ratio of compute resources to memory transfer resources (infiniband) is much better (or worse, depending on how you look at it). The scaling is very near perfectly linear, as long as you have more than about 500 particles per CPU core.

Absolutely! This would immediately halve the amount of data I need to move over PCIe.

I will also get a decent benefit when pinned memory gets fast transfers to all GPUs, a feature that is already in the pipeline.

Hmm. I hadn’t thought about streaming. I’m just copying the whole thing in one chunk. How would the streaming help?

As a little more detailed info: what it boils down to is that I have an array of N float4 elements on each GPU (say, N=64000). N/2 are updated on each GPU during the iteration and then I need to share the updates between the GPUs before the next iteration can start. So: elements 0-31999 get copied to the host from GPU 0 and elements 32000 to 63999 get copied to the host from GPU 1 (these calls are made in parallel). After those memcpys are complete, the host to device memcpy calls are made, copying elements 32000 to 63999 from the host to GPU 0 and 0-31999 to GPU 1.

So the total memory copied is only ~1/2 MiB from one GPU to the other and vice versa each step. Would streaming help with such a small transfer? The total PCIe traffic is ~2 MiB. At the obtainable 2 GiB/s (pinned for fast transfers to/from GPU 0, paged transfers to/from GPU 1), that works out to about 1 millisecond of communication per iteration. The computations for the iteration (say, running on 2 GTX 280s) take only 2 milliseconds, so you can see the communication is a sizable fraction of that.

And if you are really curious: no, not all N/2 elements need to be copied over. But filtering out only those that need to be copied is an expensive stream compaction operation that takes up almost as much time as it saves by reducing the PCIe transfers. The massive additional code complexity to track what needs to be copied wasn’t worth it. If I can just optimize the simple block memory copy a little bit more, I’ll get the same benefit without the complexity.

alex_dubinsky · December 25, 2008, 11:27pm

With just a half megabyte of data, you’re right that you haven’t got much meat to work with. But I can’t say for sure, because it seems bandwidthTest isn’t too accurate esp at the lower end of the scale. On my system, I should only be getting 0.5GB/s, not 2GB/s.

A few more ideas: Maybe putting GPU 1 on pinned memory and handling the CPU->CPU transfer manually will bring a slight boost. If your floats are all bounded, you can compress them into 16bit fixed-point numbers (losing 8bits of precision).

Ultimately, I think you’re overlooking the obvious. You need to increase the number of atoms in your benchmarking simulations. I see that this is what helps scaling the most.

Also, your code doesn’t quite follow Amdahl’s law. If you move to 4 GPUs, time spent transfering data should be less than with 2 GPUs. Even with 8 GPUs, time spent transfering data might not exceed time spent processing it. This is the bright side of your algorithm. With a larger atom count, that point might not be reached until you have a respectable cluster. In either case, whenever transfer time does not exceed processing time, you can run 2 simulations at once to double the utilization of your hardware capital. (You might have to code this capability in directly, because automatic GPU sharing via multiple contexts doesn’t work well at all.)

E.D_Riedijk · December 26, 2008, 7:22am

Oh yeah, offcourse. As long as you process stuff slowly, your communication is not that much of a bottleneck. But fortunately we can say we have a luxury problem ;)

Ah, yes. That is indeed already a partial solution. Core i7 would also help you a lot at this time, given the fact that pageable memory is not that much worse than pinned on that platform.