Trouble with performance of same card type on different hardware

Hi,

I’m not sure if this is the right forum, but I didn’t see a better one.

I’ve got 2 machines that each have a gtx 1080, but the other hardware is quite different. Both are running Linux (although different distros), similar drivers (378 and 375) and both using cuda 8.0.

Running the same cuda aware program, based on the neon framework https://github.com/NervanaSystems/neon, they perform very differently. About 30-40% speed difference between machines 1 and 2. See the hardware info below.

Hardware Info Machine 1:
Motherboard: Gigabyte Z87-D3HP
Memory: 4x8GB (32GB total) of DDR3 @ 1866 MHz
PCIE Info (from lspci):

LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
    ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
    ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

Hardware Info Machine 2:
Motherboard: Intel S5520SC
Memory: 12x8GB (96GB total) of DDR3 @ 800MHz
PCIE Info (from lspci):

LnkCap: Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us
    ClockPM+ Surprise- LLActRep- BwNot-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
    ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

I also tried running cuda-z on each machine. The compute performance was nearly identical, which was expected. However, the memory transfer rates were very different. I expected some degradation for machine 2 due to the slower memory and bus speed, but not as much as I observed.
The output of cuda-z for machine 1 was consistent and roughly:
Host to Device: ~11GiB/s and Pageable H->D ~ 8000 MiB/s
Device to Host: ~11GiB/s and Pageable D->H ~ 6500 MiB/s

The output of cuda-z for machine 2 was less consistent (the values varied in the ranges indicated):
Host to Device: ~ 2500-5000 MiB/s and Pageable H->D ~ 800-2000 MiB/s
Device to Host: ~ 2500-5000 MiB/s and Pageable H->D ~ 800-2000 MiB/s

I only have a basic understanding of what these values mean, so, given these observations I have several questions:

  1. The transfer rate between the device and host is less than half on machine 2 and nearly 10 times as slow for the pageable transfers. This seems like a lot. Is this amount of decrease expected based on the hardware differences?
  2. Given these numbers is the degradation in performance of my application in line with these differences?
  3. Is there anything I can do about this without having to muck around in the (neon) library's GPU code or buy new hardware?

I assume machine 1 is the faster machine on your actual neon test case (I don’t know that you explicitly state that anywhere?)

It appears that machine 1 has a PCIE Gen3 slot whereas machine 2 has a Gen2 slot. That would account for the 2x difference in pinned GPU-CPU transfer rates. Pageable transfers are often approximately 2x slower than pinned transfers.

Are either or both machines running a display off of the GTX1080? Specifically, is X configured for either machine to use the GTX 1080?

Machine 2 is probably significantly slower than machine 1 on general system benchmarks. The memory is much slower and the CPU is probably from an older family as well. Without knowing the exact workload, it’s not clear what portion of the slowdown is coming from GPU related activity and what portion is from CPU activity. However since the GPUs are identical, it’s reasonable to assume that system and system architecture are the primary difference.

If you want to, you could use a profiler to get an idea of how much of the total runtime (and variation in performance) is due to, for example, GPU kernel execution, and host<->device memory transfers. This would help validate or rule out some of your theories and possible solutions.

I seriously doubt “mucking” with the GPU code in the neon library is going to be a useful exercise.

Thanks for the quick reply and indeed Machine 1 is faster. Looks like I’ll try some profiling and see if I can parse out a bit more about what’s going on.