PCI-E 3.0 possible on the K20c?

Made the upgrade from the GTX 680 to the K20c, and I see that PCI-E 3.0 is not supported by this card? I installed the driver and it sees that card, but maybe something still is not configured correctly? I did download and install the driver.

My bandwidth speed test is half of what is was with the 680, and am surprised that it wants to use PCI-E 2.0 when I have been using PCI-E 3.0 with the 680. Both cards are installed but for calculations it wants to use the K20 (device 0) which is running at a much lower speed than the 680

I really had things configured perfectly with the 680… Is there any resource out there which at least get this to a similar performance level?

Operating system is Windows 7 64 bit, and the motherboard supports PCI-E 3.0, which still works for the 680 alone.

I should probably re-frame this question. If the K20c only supports PCI-e 2.0, is there a way to improve the bandwidth speed to a level comparable to that of speed of the GTX 680?

I have two slots, and the motherboard supports PCI-e 3.0 x 16. I have heard there are BIOS tweaks for PCI-e 2.0 which will improve bandwidth speed. Do not want to go down that route until I make sure that I have not missed some issue during installation which could be causing the slow bandwidth speed.

I should mention I already turned ECC off for the K20 …

I tried to enable PCIE 3.0 speeds via a registry hack on Windows 7 for a Tesla K20c and was unsuccessful. Have not tried in Linux with the PCIE 3.0 flag passed to the NVIDIA module… but NVIDIA does advertise the K20 as PCI-E 2.0… so I sort of doubt that it will work at PCI-E 3.0…

FWIW, GTX Titan is capable of PCIE 3.0 and it has 14 SMX enabled… AND is overclockable in software… I’ve been able to get over 100% TDP numbers on a CUDA DP code. If you don’t need ECC/TCC or the 32 MPI processes at the same time, get one of those, you won’t be disappointed.

Yes, that Titan sounds nice!
My current work are the ones that got the K20s so I have to get the most utility out those GPUs. Overall they are outperforming the 680s by about 30%(on the current set on algorithms), but I think even with PCI-e 2.0 I should be able to improve the bandwidth from the current 3200 level for host to device transfers. The 680 gets close to 6700 for the same test.

I did find a rather heated discussion on the topic of PCI-e speed and CUDA;

[url]http://setiathome.berkeley.edu/forum_thread.php?id=62704[/url]

After scanning through that thread I still am not sure how much the slower bandwidth of the K20 will matter for the type of workload I anticipate. Most of the algorithms will be bandwidth dependent, so even if I can get a 50% increase from my current level that would be worth the time spent finding a ‘hack’.

I wonder if up dating the BIOS will help with this issue. Currently I have a Asus Maximus V Gene motherboard and the BIOS is American Megatrends version 0402, which is about a year old.

On some computation tasks the K20 is running slower than the 680, and for others the K20 is running faster. The slower bandwidth speed seems to be the only explanation, as I have tried modifying the thread/block sizes to see if that makes a difference on the K20 vs the 680.

Anybody out there get an increase in the bandwidth speed from an updated BIOS? Have to evaluate the risk reward of my options.

You could certainly try to update the BIOS, but I don’t think that would make much of a difference. If you have the flexibility to do so, install some Linux (Ubuntu, for example) distro and set the PCIE3.0 flag as I did:

[url]https://devtalk.nvidia.com/default/topic/533200/linux/gtx-titan-drivers-for-linux-32-64-bit-release-/post/3753244/#3753244[/url]

If you can’t because this is hardware at work, I could probably try it out on my system… just don’t feel like changing cards right now, haha.

What kind of throughput are you seeing with the K20c for host/device transfers? You would want to use cudaMemcpyAsync() with pinned host memory and CUDA streams for best performance. Teslas have dual DMA engines that allow the simultaneous copying from and to the device. This often allows for the complete overlap of copies with kernel execution: While kernel N is running, results produced by kernel N-1 are copied back to the host, while input data for kernel N+1 is copied down to the device.

I have a K20c in an older workstation (HP xw8600) here that runs with PCIe 2, and I see the following transfer rates with pinned host memory:

^^^^ using pinned host memory
^^^^ for throughput results, 1 MB = 1,000,000 bytes
^^^^ h2d: bytes= 16777216 time= 2747.06 usec rate=6107.34MB/sec
^^^^ d2h: bytes= 16777216 time= 2511.98 usec rate=6678.89MB/sec

I found out today that is issue has to do with the motherboard. When I have two GPUs in both slots, the effective pci-e width goes from x16 to x8. It is still PCI-e 3.0 but the K20 cannot take advantage of that speed, while the 680 can.

With the two 680s installed(which are pci-e 3.0) they both had apx 6300 MB/sec each. When I have a single K20 paired with a 680 the K20c gets about 3,200 MB/sec and the 680 gets 6300 MB/sec. If I take one of them out and only use a single GPU then the bandwidth speed doubles for either. This seems to be due to the fact the K20 is pci-e 2.0 while the 680 is 3.0.

So it appears to be an unfortunate combination of hardware, but overall the K20 usually performs better than the 680 for single precision/integers, even factoring in the slower bandwidth speed.

If I was not also a fan of FPS PC games, I would just use the K20, but on occasion I like to play Far Cry 3. The 680 really is a nice combination of video-game performance and excellent computing power, assuming you do not need double-precision.

Still your advice is appreciated, and if there is any way to further boost that speed I would love to know about it.

Have not yet used cudaMemcpyAsync(), so thanks for that tip!