Host to Device Bandwidth Degradation [SOLVED]

I’m going to use the NBody and BandwidthTest of the CUDA Samples to explain the problem.
Here is a brief system config:
-i7-4960X
-Three Titans
-P9X79-E WS motherboard

The Host to Device Bandwidth before running the NBody application is as follows:
[CUDA Bandwidth Test] - Starting…
Running on…

Device 0: GeForce GTX TITAN
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 11304.0

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12494.3

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 231508.3

Result = PASS

After running the NBody application the bandwidth test result is as follows:
[CUDA Bandwidth Test] - Starting…
Running on…

Device 0: GeForce GTX TITAN
Quick Mode

Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 7196.3

Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 12491.5

Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 230486.4

Result = PASS

As you can see the Host to Device Bandwidth is reduced from 11304.0 MB/s to 7196.3 MB/s. I have to reboot the system for it to go back up. Looks like a GPU resource leak? Any thoughts?

Perhaps the card is dropping down to PCIe 2.0?

The GPU-Z app reports a card’s PCIe capability and actual connection speed in real time. That might quickly confirm or dismiss that there is a problematic change in the PCIe speed.

Hi Allanmac,

Thanks for your reply. The pcie stays at gen 3.0. I’ve checked that with GPUZ. I don’t think that’s the reason though since the Device to Host bandwidth doesn’t degrade.

I forgot to add that the NBody needs to run for a couple of minutes before running the bandwidth test.

Update:

This problem is more evident on Multi_GPU systems running Multi_GPU applications. NBody is the perfect application that takes advantage of the Multi_GPU systems.

Someone from Nvidia, please respond.

I see in your signature that you have 3x Titans installed. Does this occur only on device 0?

Was anything else running (e.g., something on one of the other Titans) during the bandwidth test? Your CPU only supports 40 PCIe 3.0 lanes, so the motherboard uses PLX chips to support quad SLI. I see two PLX 8747 chips in the motherboard diagram in the manual. At a quick glance, I did not see a diagram of the full PCIe topology, but my guess is that each PLX chip has 16 lanes going back to the CPU and 16 lanes going to each of two x16 PCIe 3.0 slots (the blue slots). So, presumably, simultaneous transfers between the host and both devices connected to a single PLX chip will cut the effective bandwidth of each transfer in half because there are only 16 lanes connecting the CPU to the PLX chip. If you are only using three of the x16 slots, then one of the devices should have 16 dedicated lanes back to the CPU (assuming your are not using the x8 slots for something else like a NIC); does that device ever show this slowdown?

The transfer rate you see would also be consistent with 8 PCIe 3.0 lanes, so checking the number of active lanes during a transfer may be helpful as well.

Hi Tbenson,

Thanks for your reply.

You are correct, P9X79-E WS has two 8747 chips.
GPU 0 and 1 are on one PLX chip and GPU 2 is on the other one. The concurrent bandwidth to both GPU 0 and 1 is half (8x pcie 3.0) of the concurrent bandwidth to GPU 0 and 2 (16x pcie 3.0) which makes sense since GPU 0 and 1 are on the same PLX chip.

The concurrent bandwidth to both GPU 0 and 2 (or GPU 1 and 2) is at 16x pcie 3.0 spec which again makes sense since 0&2 and 1&2 are on different PLX chip.

Now, the individual bandwidth to each GPU before running nbody is at 16x pcie 3.0 spec (again what it should be).

The problem is after running nbody for a few minutes the individual bandwidth to any of the cards (individually) is reduced/degraded and remain crippled long after nbody is finished/terminated. The numbers never go back to their normal levels. I have to do a hard reboot to fix the problem.

I forgot to add that I check the PCIE lanes before and after the benchmarks and they stay on gen 3.0 (using GPUZ). I thought the motherboard is at fault and drops the PCIE lanes from 3.0 to 2.0. The problem is either the motherboard or the driver. I just want to rule out the driver as the culprit if possible :)

Never Mind… Done

Could you post the final cause and your solution?
Many thanks
Bill

Hi Bill,

I will post a fix that worked for me in a couple of days. I’d like to run more tests to make sure the fix wasn’t a fluke. Are you encountering this problem?

After more than a week of testing, I’m confident about what really happened and how I fixed it.

I was overclocking my CPU and undervolting it to reduce power consumption and heat. It looks like, the amount of undervolting was too aggressive. As a result of the undervolting the CPU would drop the PCI-E bandwidth under load. I’m still not sure if this a feature or a bug in the design of the CPU. Although, I am leaning more towards the former than the latter. The system was 24/7 stable which was the reason I didn’t notice it in the first place.