Tesla C1060 Memory Bandwidth

I used bandwidthtest in the SDK to test the new C1060 I just bought. The device to device memory bandwidth is about 74 GB/s, which is quite difference from the one in the spec (102 GB/s). I am wondering what may cause this difference? I am using CUDA 2.3 and Windows 7 64-bit.

Thanks!

I get the same result with CUDA 2.3 Ubuntu 9.04 64-bit.

stefano@rampage:~/project$ ~/NVIDIA_GPU_Computing_SDK/C/bin/linux/release/bandwidthTest

Running on......

	  device 0:Tesla C1060

Quick Mode

Host to Device Bandwidth for Pageable memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   5426.9

Quick Mode

Device to Host Bandwidth for Pageable memory

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   4758.4

Quick Mode

Device to Device Bandwidth

.

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   73475.7

&&&& Test PASSED

102GB/s is marketing/peak performance :)

74GB/s is good :)

eyal

Even though, it’s discouraging that even NVIDIA’s own memory bandwidth tool only achieves only 74 GB/s.

How is the bandwidth measured from device to device? Might it be that reading and writing is measured instead

of the bandwidth in one direction (reading from memory)?

How much bandwidth do you geht with a simpile linear access pattern like device_memory[threadIdx.x]?

If you run the STREAM benchmark, you can see 82 GB/s.

Device Selected 1: “Tesla C1060”
STREAM Benchmark implementation in CUDA
Array size (double precision)=6000000
using 384 threads per block, 15625 blocks
Function Rate (MB/s) Avg time Min time Max time
Copy: 82258.0560 0.0012 0.0012 0.0012
Scale: 82123.8393 0.0012 0.0012 0.0012
Add: 82006.7585 0.0018 0.0018 0.0018
Triad: 82006.7585 0.0018 0.0018 0.0018

First, bandwidthTest reports GiB/s, not Gb/s. 102 Gb/s = 95GiB/s.

In my experience, the sustained RAM bandwidth is usually 2/3 of the advertised value. Why? Because that’s the burst speed when the memory chips transmits every cycle:

1600MhZ * 8 bytes/bank * 8 banks = 95GiB/s

Current memory technology use burst mode to read 4 or 8 adjacent memory cells in the same row to avoid multiple column address latencies. This obviously can’t be sustained and switching to a different DRAM row will take even longer.

If you look at figure 18 here, it shows 5 cycles of read latency followed by 4 cycles of actual data output (44% peak bandwidth). Figure 19 shows the best case when you’re able to completely overlap the column addressing latency.

Not quite sure I understand how you come to this conclusion. I would compute the theoretical peak memory bandwidth as

0.8 GHz memory clock * 2 (DDR memory) * 512 bit memory bus / 8 (since we want Byte)

which would give 102.4 GiB/s

Not quite sure I understand how you come to this conclusion. I would compute the theoretical peak memory bandwidth as

0.8 GHz memory clock * 2 (DDR memory) * 512 bit memory bus / 8 (since we want Byte)

which would give 102.4 GiB/s

“which would give 102.4 GiB/s”
No, it’s 95 GiB/s

The definitions of GHz & GiB are:

1 GHz = 10^9 Hz
1 GiB = 2^30 bytes

“which would give 102.4 GiB/s”
No, it’s 95 GiB/s

The definitions of GHz & GiB are:

1 GHz = 10^9 Hz
1 GiB = 2^30 bytes

My bad, didn’t realize GHz are of course decimal based. Thanks for clarifying!

My bad, didn’t realize GHz are of course decimal based. Thanks for clarifying!