nvvp reports low throughput for host to device and device to host on one card only

Hi,

I’m running Toolkit 8.0 on a Linux Mint machine with driver 367.57, and noticed that while running my application on two GTX 1080 cards, one card had a typical DMA tranfer rate of around 6GB/s, but the other card seemed to only reach about 1.4GB/s. The cards are running identical code. Furthermore a yellow triangle enclosing a “!” appears next to the slower car’s throughput figures. Both cards correctly show the CPU memory as pinned.

There is no explanation I can find that shows what the yellow warning means (other than the obvious “this seems too slow”), and no guidance on how to fix it. Is it simply a difference due to the PCIE bandwidths associated with each card position?

Hi,kris314

Thanks for reporting.
Would you please answer following question first ?

  1. Have you export CUDA_VISIBLE_DEVICES=0 or 1 before you do the profile?
  2. How do you get the DMA transfer rate data? By kernel memory analysis or collect metrics? If latter, which metric are you refer?
  3. Where do you see the yellow triangle, under guided or unguided analysis? Which analysis option?
  4. What’s the info after the yellow triangle?

Hi,

Thanks for any help.

  1. Have you export CUDA_VISIBLE_DEVICES=0 or 1 before you do the profile?

    No, I have not, and “echo $CUDA_VISIBLE_DEVICES” reports nothing.

  2. How do you get the DMA transfer rate data? By kernel memory analysis or collect metrics? If latter, which metric are you refer?

    By hovering the cursor over an NVVP DMA transfer in a MemCpy (DtoH or HtoD) row in the timing display, and noting the throughput for several such transfers in the “properties” box on the lower-right corner.

  3. Where do you see the yellow triangle, under guided or unguided analysis? Which analysis option?

    I don’t use guided or unguided analysis, but take the results straight from the “Properties” box in the timing display immediately after a run.

  4. What’s the info after the yellow triangle?
    A throughput number in the range 1.30-1.48GB/s, and typically around 1.35GB/s

I note that three such transfers are occurring in parallel, so the combined rate is maybe 4GB/s for all three. The GPU in the “fast” card slot reports a combined rate of 6.7-8GB/sec.

Thanks again,

kris314

Thanks kris314.

I understand the problem now, it would be better if you can provide us a little example that can reproduce this issue.

Hi,kris314

It appear to indicate that one of the PCI slot or GPU negotiated speed is slower than other.

You can run cuda sample /usr/local/cuda/samples/1_Utilities/bandwidthTest to check the bandwidth for each device by “export CUDA_VISIBLE_DEVICES=0” and “export CUDA_VISIBLE_DEVICES=1”

‘lspci -vv’ command should be able to generate detailed information about all PCI devices.
You can find info like below:

02:00.0 VGA compatible controller: NVIDIA Corporation Device 1b80 (rev a1) (prog-if 00 [VGA controller])
Subsystem: NVIDIA Corporation Device 119e


Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
LnkCap: Port #1, Speed 5GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <512ns, L1 <16us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-

            LnkSta: [b][u]Speed 2.5GT/s, Width x8[/u][/b], TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

Then, you can swap the 2 GPU and try again to confirm if it is PCI slot caused the difference

Thanks so very much!

The command “sudo lspci -vv” allowed me to find the exact same information (re: Speed & Width) as you show above. Something probably not easily found (or easily overlooked) in the motherboard specs. Certainly the problem follows the slot, not the card.

Thanks again,

kris314