Those are memory bandwidth numbers. The bandwidth test you are running is measuring PCI-e express transfer bandwidth (in the case of the host-device and device-host numbers). Two completely different things. The theoretical bandwidth limit for a 16 lane PCI-e 1.0 bus is 4Gb/s, 8Gb/s for a 16 lane PCI-e 2.0 bus.
Dear all:
configure GTX295 + Tesla C1060 is O.K. now, information of NVIDIA control panel is correct.
I just reboot the system. Moreover bandwidthTest reports
[codebox]-------------------±------------------±------------------±-----------------------+
device name | device 0: GTX295 | device 1: GTX295 | device 2: Tesla C1060 |
-------------------±------------------±------------------±-----------------------+
Host to device | 1140 MB/s | 1141 GB/s | 2326 MB/s |
-------------------±------------------±------------------±-----------------------+
device to Host | 1713 MB/s | 1720 MB/s | 1729 MB/s |
-------------------±------------------±------------------±-----------------------+
device to device | 93792 MB/s | 91045 MB/s | 73384 MB/s |
-------------------±------------------±------------------±-----------------------+ [/codebox]
"The theoretical bandwidth limit for a 16 lane PCI-e 1.0 bus is 4Gb/s, 8Gb/s for a 16 lane PCI-e 2.0 bus. "
I know my machine has much lower PCIe bandwidth than theoretical value,
what I am concerned isbandwidth of device-to-device, GTX295 reaches 80% of maximum bandwidth and
Tesla C1060 reaches 72% of maximum bandwidth, is this normal?
also in order to check maximum size of allocation on Tesla C1060,
I compute C = A * B where A, B, C are square matrix with dimension N
table 1: test cublasDgemm
[codebox]field description: (time unit: ms)
N = dimensio of A, B, C
total size = size(A) + size(B) + size© = N^2 * 3 * 8 bytes
CPU: single thread, block version of C = A*B
h2d: data transfer from host to device, h_A → d_A and h_B → d_B
C = A*B in kernel
d2h: data transfer from device to host, d_C → h_C
speedup CPU/(C= A*B in GPU)
-------±-----------±-------±-----±-------±-----±--------+
N | total size | CPU | GPU | GPU | GPU | CPU/GPU |
| (MB) | (ms) | h2d | C=A*B | d2h | |
-------±-----------±-------±-----±-------±-----±--------+
1024 | 24 | 1094 | 0 | 31 | 0 | 35.3 |
-------±-----------±-------±-----±-------±-----±--------+
2048 | 96 | 9938 | 31 | 219 | 31 | 45.4 |
-------±-----------±-------±-----±-------±-----±--------+
4096 | 384 | 82016 | 109 | 1813 | 93 | 45.2 |
-------±-----------±-------±-----±-------±-----±--------+
8192 | 1536 | 680718 | 421 | 14579 | 375 | 46.7 |
-------±-----------±-------±-----±-------±-----±--------+
13280 | 4036.5 |3453313 | 1125 | 72641 | 1031 | 47.5 |
-------±-----------±-------±-----±-------±-----±--------+[/codebox]
So far, I can allocate 4GB memory on Tesla C1060 under winxp pro 64
next I compare GTX295 with Tesla C1060 by testing cublasDgemm
table 2: GTX295 (one GPU fo the two) versus Tesla C1060
time unit: ms
[codebox]-------±-----------±-----±-------±-----+
N | total size | GPU | GPU | GPU |
| (MB) | h2d | C=A*B | d2h |
-------±-----------±-----±-------±-----+
1024 | 24 | 0 | 31 | 0 |
| | 16 | 31 | 16 |
-------±-----------±-----±-------±-----+
2048 | 96 | 31 | 219 | 31 |
| | 62 | 235 | 31 |
-------±-----------±-----±-------±-----+
4096 | 384 | 109 | 1813 | 93 |
| | 219 | 1906 | 109 |
-------±-----------±-----±-------±-----+
5760 | 760 | 203 | 5062 | 203 |
| | 437 | 5282 | 203 |
-------±-----------±-----±-------±-----+[/codebox]
from table 2, Tesla C1060 is slightly faster than GTX295 when computing C = A*B
even its device-device bandwidth is smaller than GTX295.