I’m not sure if this is the right forum, but I didn’t see a better one.
I’ve got 2 machines that each have a gtx 1080, but the other hardware is quite different. Both are running Linux (although different distros), similar drivers (378 and 375) and both using cuda 8.0.
Running the same cuda aware program, based on the neon framework https://github.com/NervanaSystems/neon, they perform very differently. About 30-40% speed difference between machines 1 and 2. See the hardware info below.
Hardware Info Machine 1:
Motherboard: Gigabyte Z87-D3HP
Memory: 4x8GB (32GB total) of DDR3 @ 1866 MHz
PCIE Info (from lspci):
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt- LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
Hardware Info Machine 2:
Motherboard: Intel S5520SC
Memory: 12x8GB (96GB total) of DDR3 @ 800MHz
PCIE Info (from lspci):
LnkCap: Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <1us, L1 <4us ClockPM+ Surprise- LLActRep- BwNot- LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt- LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
I also tried running cuda-z on each machine. The compute performance was nearly identical, which was expected. However, the memory transfer rates were very different. I expected some degradation for machine 2 due to the slower memory and bus speed, but not as much as I observed.
The output of cuda-z for machine 1 was consistent and roughly:
Host to Device: ~11GiB/s and Pageable H->D ~ 8000 MiB/s
Device to Host: ~11GiB/s and Pageable D->H ~ 6500 MiB/s
The output of cuda-z for machine 2 was less consistent (the values varied in the ranges indicated):
Host to Device: ~ 2500-5000 MiB/s and Pageable H->D ~ 800-2000 MiB/s
Device to Host: ~ 2500-5000 MiB/s and Pageable H->D ~ 800-2000 MiB/s
I only have a basic understanding of what these values mean, so, given these observations I have several questions:
- The transfer rate between the device and host is less than half on machine 2 and nearly 10 times as slow for the pageable transfers. This seems like a lot. Is this amount of decrease expected based on the hardware differences?
- Given these numbers is the degradation in performance of my application in line with these differences?
- Is there anything I can do about this without having to muck around in the (neon) library's GPU code or buy new hardware?