GTX295 + Tesla C1060 on xp

Dear all:

I use latest driver 190.38 (also cuda 2.3) to configure GTX295 + Tesla C1060 on winxp pro 64.

morever I connect GTX295 to two LCDs.

everything is O.K. and I can see three GPUs in NVIDIA control panel, however something strange

in NVIDIA control panel, the information is

[codebox]Geforce GTX 295 (GPU 1 )

Processor Clock : 1242 MHz

Graphics Clock : 576 MHz

Memory Clock : 999 MHz     

(this is the same as SPEC)

Geforce GTX 295 (GPU 2 )

Processor Clock : 1315 MHz

Graphics Clock : 610 MHz

Memory Clock : 400 MHz   

              ^^^^^^^^^

        this is only 2/5 of SPEC

Tesla C1060 (GPU3)

Processor Clock : 1296 MHz

Graphics Clock : 610 MHz

Memory Clock : 800 MHz  

(this is the same as SPEC)[/codebox]

moreover if I use “bandwidthTest.exe” in sDK example, then

[codebox]-------------------±------------------±------------------±-----------------------+

device name | device 0: GTX295 | device 1: GTX295 | device 2: Tesla C1060 |

-------------------±------------------±------------------±-----------------------+

Host to device | 1140.1 MB/s | 1138.9 GB/s | 2347.5 MB/s |

-------------------±------------------±------------------±-----------------------+

device to Host | 1728 MB/s | 1726.1 MB/s | 1717.0 MB/s |

-------------------±------------------±------------------±-----------------------+

device to device | 36747.3 MB/s | 91224.2 MB/s | 73311.4 MB/s |

-------------------±------------------±------------------±-----------------------+ [/codebox]

Theoretical bandwidth of GTX295 per GPU is 111.9 GB/s

Theoretical bandwidth of GTX295 is 102.4 GB/s

Question: why do device 0 of GTX295 and Tesla C1060 have low bandwidth?

Does this mean that driver 109.38 is not proper and I need to buy a Quadro card and use

official suggested driver 186.30 ?

Those are memory bandwidth numbers. The bandwidth test you are running is measuring PCI-e express transfer bandwidth (in the case of the host-device and device-host numbers). Two completely different things. The theoretical bandwidth limit for a 16 lane PCI-e 1.0 bus is 4Gb/s, 8Gb/s for a 16 lane PCI-e 2.0 bus.

Dear all:

configure GTX295 + Tesla C1060 is O.K. now, information of NVIDIA control panel is correct.

I just reboot the system. Moreover bandwidthTest reports

[codebox]-------------------±------------------±------------------±-----------------------+

device name | device 0: GTX295 | device 1: GTX295 | device 2: Tesla C1060 |

-------------------±------------------±------------------±-----------------------+

Host to device | 1140 MB/s | 1141 GB/s | 2326 MB/s |

-------------------±------------------±------------------±-----------------------+

device to Host | 1713 MB/s | 1720 MB/s | 1729 MB/s |

-------------------±------------------±------------------±-----------------------+

device to device | 93792 MB/s | 91045 MB/s | 73384 MB/s |

-------------------±------------------±------------------±-----------------------+ [/codebox]

"The theoretical bandwidth limit for a 16 lane PCI-e 1.0 bus is 4Gb/s, 8Gb/s for a 16 lane PCI-e 2.0 bus. "

I know my machine has much lower PCIe bandwidth than theoretical value,

what I am concerned isbandwidth of device-to-device, GTX295 reaches 80% of maximum bandwidth and

Tesla C1060 reaches 72% of maximum bandwidth, is this normal?

also in order to check maximum size of allocation on Tesla C1060,

I compute C = A * B where A, B, C are square matrix with dimension N

table 1: test cublasDgemm

[codebox]field description: (time unit: ms)

  1. N = dimensio of A, B, C

  2. total size = size(A) + size(B) + size© = N^2 * 3 * 8 bytes

  3. CPU: single thread, block version of C = A*B

  4. h2d: data transfer from host to device, h_A --> d_A and h_B --> d_B

  5. C = A*B in kernel

  6. d2h: data transfer from device to host, d_C --> h_C

  7. speedup CPU/(C= A*B in GPU)

-------±-----------±-------±-----±-------±-----±--------+

N | total size | CPU | GPU | GPU | GPU | CPU/GPU |

   |   (MB)     | (ms)   | h2d  | C=A*B  | d2h  |         |

-------±-----------±-------±-----±-------±-----±--------+

1024 | 24 | 1094 | 0 | 31 | 0 | 35.3 |

-------±-----------±-------±-----±-------±-----±--------+

2048 | 96 | 9938 | 31 | 219 | 31 | 45.4 |

-------±-----------±-------±-----±-------±-----±--------+

4096 | 384 | 82016 | 109 | 1813 | 93 | 45.2 |

-------±-----------±-------±-----±-------±-----±--------+

8192 | 1536 | 680718 | 421 | 14579 | 375 | 46.7 |

-------±-----------±-------±-----±-------±-----±--------+

13280 | 4036.5 |3453313 | 1125 | 72641 | 1031 | 47.5 |

-------±-----------±-------±-----±-------±-----±--------+[/codebox]

So far, I can allocate 4GB memory on Tesla C1060 under winxp pro 64

next I compare GTX295 with Tesla C1060 by testing cublasDgemm

table 2: GTX295 (one GPU fo the two) versus Tesla C1060

time unit: ms

[codebox]-------±-----------±-----±-------±-----+

N | total size | GPU | GPU | GPU |

   |   (MB)     | h2d  | C=A*B  | d2h  |         

-------±-----------±-----±-------±-----+

1024 | 24 | 0 | 31 | 0 |

   |            |   16 |    31  |   16 |

-------±-----------±-----±-------±-----+

2048 | 96 | 31 | 219 | 31 |

   |            |   62 |   235  |   31 |

-------±-----------±-----±-------±-----+

4096 | 384 | 109 | 1813 | 93 |

   |            |  219 |  1906  |  109 |

-------±-----------±-----±-------±-----+

5760 | 760 | 203 | 5062 | 203 |

   |            |  437 |  5282  |  203 |

-------±-----------±-----±-------±-----+[/codebox]

from table 2, Tesla C1060 is slightly faster than GTX295 when computing C = A*B

even its device-device bandwidth is smaller than GTX295.