Bandwidth problems with Fermi based GPUs (GTX 480 GTX 460)

I’m getting some pretty slow device-to-device memory bandwidth running on Fermi based hardware.
Below are the results of bandwidthTest.
Summary: GTX 280 is reported as having faster device-to-device memory transfers than either a GTX 460 or a GTX 480. That can’t be right. The GTX 460 (1GB version) is roughly half of the max theoretical, GTX 480 is almost 60GB shy of max theoretical, while GTX 280 almost achieves max theoretical.

Driver details:
GTX 480 is using 256.44 release driver
GTX 460 didn’t seem to work with 256.44, so I installed 256.40 development driver
GTX 280 is using 195.36.15 development driver
I also noticed the somewhat slower PCI bandwidth on the GTX 460, but that’s not a concern at the moment, the slow device memory is.

Device 0: GeForce GTX 460
Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 4081.2

Device to Host Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 4657.4

Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 59033.4

Device 0: GeForce GTX 480
Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5284.3

Device to Host Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5001.9

Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 118115.3

Device 0: GeForce GTX 280
Quick Mode

Host to Device Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 5290.2

Device to Host Bandwidth, 1 Device(s), Paged memory
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 4224.9

Device to Device Bandwidth, 1 Device(s)
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 122501.0

Im getting very similar results with my GTX 480. But when running a simple, bandwidth-limited test kernel (copy data from array A to B ) I can see a d2d bandwidth that is very close to the theoretical max. So maybe its just the bandwidth test? How do your kernels perform?

Im getting very similar results with my GTX 480. But when running a simple, bandwidth-limited test kernel (copy data from array A to B ) I can see a d2d bandwidth that is very close to the theoretical max. So maybe its just the bandwidth test? How do your kernels perform?

My kernels are neither compute nor bandwidth bound (1000 256K FFTs/IFFTs per second on 500MB of data, plus some conversion routines, among a few other kernels) so I can’t comment on if I’m getting the bandwidth I’m expecting. As far as performance I’m not quite getting twice the performance on my GTX 480 as I was on my GTX 280.

Using a basic test application (just a bunch of 256K FFTs/IFFTs using cuFFT without any data transfer to/from the GPU) I get around an 84% increase in performance (GTX 280 - 813ms, GTX 480 - 442ms)

I just found it odd the bandwidth test reports LESS d2d bandwidth for the 480 when the max theoretical is higher. Perhaps the narrower memory controller of both the GTX 460 and GTX 480 is to blame, but for a 32MB transfer, I wouldn’t expect that to be the case. Or as you pointed out, perhaps it’s simply an artifact of the bandWidthTest.

My kernels are neither compute nor bandwidth bound (1000 256K FFTs/IFFTs per second on 500MB of data, plus some conversion routines, among a few other kernels) so I can’t comment on if I’m getting the bandwidth I’m expecting. As far as performance I’m not quite getting twice the performance on my GTX 480 as I was on my GTX 280.

Using a basic test application (just a bunch of 256K FFTs/IFFTs using cuFFT without any data transfer to/from the GPU) I get around an 84% increase in performance (GTX 280 - 813ms, GTX 480 - 442ms)

I just found it odd the bandwidth test reports LESS d2d bandwidth for the 480 when the max theoretical is higher. Perhaps the narrower memory controller of both the GTX 460 and GTX 480 is to blame, but for a 32MB transfer, I wouldn’t expect that to be the case. Or as you pointed out, perhaps it’s simply an artifact of the bandWidthTest.

Dan,

I see a very similar result with a machine that has a GTX260 and a 1GB GTX460 device installed. The GTX460 shows a device to device transfer of 59GB/s and the GTX260 100GB/s. Looking at the source code, all the test does is a number of cudaMemcpy commands using two device pointers. Would anyone care to explain why we see such poor results from the Fermi series GPUs when doing such transfers compared with G200 series?

Kind Regards,

Shane Cook.

I assume this is the same issue recently discussed in the following forum thread:

I attached source code for a small bandwidth test app to one of my posts in that thread, you might want to give that a try.