low memcpy performance

Hello.
I’m trying a new laptop (Centos 6.0, CUDA 6.0 (driver version331.49), Dell Inspiron with a Nvidia 750M, core i7-4500u, and 8GB ram) but the the memcpy performance even using pinned memory is very low ( on the order of 1.5 GB/s for every direction). Is this common to a laptop like this or is a sympton of some other problem?

I have measured these results using a simple program (16MB of transfers with and without pinned memory) and using the the bandwidth test program included on NVIDIA CUDA SDK.

thanks.

Mobile GPUs tend to have worse connectivity than desktop GPUs. I just benchmarked the GT 650M in my 2012 Retina MacBook Pro, and got 1.6 GB/sec host-to-device, and 3.2 GB/sec device-to-host. That asymmetry is very strange, so I’m not sure what causes that.

The system diagnostics report than the GPU is connected via an x8 connection, which I assume is PCI-E 2.0. That would be consistent with the 3.2 GB/sec value observed in the device-to-host direction, but the host-to-device direction is half what I would expect. For similar reasons, it seems like you have half the expected bandwidth, unless Dell decided to only connect the GPU to the CPU with an x4 connection.

Out of curiosity, are you using Linux or Windows. If you are using Linux, you can run “sudo lspci -vvv | less” and skim through the debug output to find your GPU and see what the electrical configuration of the connection is. Roughly speaking, you expect to see 75-80% of the theoretical bandwidth, which is 500 MB/sec per PCI-E 2.0 lane.

Thanks for your reply seibert. Following your suggestion I used the lspci -vvv command. This is the output of the command:

08:00.0 3D controller: NVIDIA Corporation GK107M [GeForce GT 750M] (rev a1)
Subsystem: Dell Device 05f6
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 16
Region 0: Memory at f6000000 (32-bit, non-prefetchable)
Region 1: Memory at e0000000 (64-bit, prefetchable)
Region 3: Memory at f0000000 (64-bit, prefetchable)
Region 5: I/O ports at d000
[virtual] Expansion ROM at f7000000 [disabled]
Capabilities: [60] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+
Address: 00000000fee0f00c Data: 4165
Capabilities: [78] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Latency L0 <512ns, L1 <4us
ClockPM+ Surprise- LLActRep- BwNot-

LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [b4] Vendor Specific Information: Len=14 <?> Capabilities: [100 v1] Virtual Channel Caps: LPEVC=0 RefClk=100ns PATEntryBits=1 Arb: Fixed- WRR32- WRR64- WRR128- Ctrl: ArbSelect=Fixed Status: InProgress- VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans- Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256- Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=01 Status: NegoPending- InProgress- Capabilities: [128 v1] Power Budgeting <?>
Capabilities: [600 v1] Vendor Specific Information: ID=0001 Rev=1 Len=024 <?>
Kernel driver in use: nvidia

As you already have guessed appears that dell connected the videocard using only a x4 PCI-Express 2.

I’m not sure that’s a valid conclusion. PCIE (especially in laptops) have power saving features that reduce link performance when there is no activity. Compare these two lines:

LnkCap: Port #0, Speed 5GT/s, Width x16, ASPM L0s L1, Latency L0 <512ns, L1 <4us

LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

ASPM L0s and L1 have to do with link modifications for power management.

What does nvidia-smi -a say particularly in the “GPU Link Info” section?

Hi txbob, this is the output of nvidia-smi -a

" PCI
Bus : 0x08
Device : 0x00
Domain : 0x0000
Device Id : 0x0FE410DE
Bus Id : 0000:08:00.0
Sub System Id : 0x05F61028
GPU Link Info
PCIe Generation
Max : N/A
Current : N/A
Link Width
Max : N/A
Current : N/A "

Apparently the nvidia-smi application only reports all the informarion with the professional line of GPU’s.

True dat. I forgot. Nevertheless I don’t think lspci is telling you that “dell connected the videocard using only a x4 PCI-Express 2.” I think you are witnessing a power-management state on the link. You could also try to aggressively exercise the link and see if lspci reports anything different while you are doing that. You could also go into the BIOS setup menu and see if there are any options to modify power management settings on the PCIE link. If there are, try fiddling with them and then boot the system again and see if you get a different indication via lspci. I believe that nvidia-settings, the linux X display control panel applet, will give you additional link info as well (click on GPU0). Finally, you could see if dell has published technical specifications that detail the HW design.

I agree. Try running a CUDA application and lspci -vvv at the same time. If the link speed goes up to 5 GT/sec, but stays at a width of x4, then that would explain the 1.5 GB/sec you see.

This is the result of the lspci -vvv during a cuda execution
“LnkSta: Speed 5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-”

The speed goes from 2.5 GT/s to 5GT/s without a change of the Width parameter.

what does nvidia-settings say about the link?