I used bandwidthtest in the SDK to test the new C1060 I just bought. The device to device memory bandwidth is about 74 GB/s, which is quite difference from the one in the spec (102 GB/s). I am wondering what may cause this difference? I am using CUDA 2.3 and Windows 7 64-bit.
If you run the STREAM benchmark, you can see 82 GB/s.
Device Selected 1: “Tesla C1060”
STREAM Benchmark implementation in CUDA
Array size (double precision)=6000000
using 384 threads per block, 15625 blocks
Function Rate (MB/s) Avg time Min time Max time
Copy: 82258.0560 0.0012 0.0012 0.0012
Scale: 82123.8393 0.0012 0.0012 0.0012
Add: 82006.7585 0.0018 0.0018 0.0018
Triad: 82006.7585 0.0018 0.0018 0.0018
First, bandwidthTest reports GiB/s, not Gb/s. 102 Gb/s = 95GiB/s.
In my experience, the sustained RAM bandwidth is usually 2/3 of the advertised value. Why? Because that’s the burst speed when the memory chips transmits every cycle:
1600MhZ * 8 bytes/bank * 8 banks = 95GiB/s
Current memory technology use burst mode to read 4 or 8 adjacent memory cells in the same row to avoid multiple column address latencies. This obviously can’t be sustained and switching to a different DRAM row will take even longer.
If you look at figure 18 here, it shows 5 cycles of read latency followed by 4 cycles of actual data output (44% peak bandwidth). Figure 19 shows the best case when you’re able to completely overlap the column addressing latency.