Performance issues of data transmission speed in PCIe EP mode

Yeahhhh · November 21, 2022, 6:11am

Hi,

After I confirmed that the connectivity of PCIe EP mode is correct：

I try to test data transfer speed with mmap and memcpy, The relevant information of EP and RP is as follows：

EP side (Orin):

root@orin:~# dmesg | grep pci_epf_nv_test
[ 3754.209715] pci_epf_nv_test pci_epf_nv_test.0: BAR0 RAM phys: 0x12f09c000
[ 3754.209745] pci_epf_nv_test pci_epf_nv_test.0: BAR0 RAM IOVA: 0xffff0000
[ 3754.209792] pci_epf_nv_test pci_epf_nv_test.0: BAR0 RAM virt: 0x00000000fc0b932a

RP side (x86):

# pci device tree
root@8208:~# lspci -tvv
-[0000:00]-+-00.0  Intel Corporation 8th Gen Core 8-core Desktop Processor Host Bridge/DRAM Registers [Coffee Lake S]
           +-01.0-[01-05]--+-00.0  NVIDIA Corporation Device 2216
           |               \-00.1  NVIDIA Corporation Device 1aef
           +-01.1-[06]----00.0  NVIDIA Corporation Device 0001

# PCIe theoretical bandwidth:
root@8208:~# lspci -s 06:00.0 -vvv
06:00.0 RAM memory: NVIDIA Corporation Device 0001
	Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 10
	Region 0: Memory at 94e00000 (32-bit, non-prefetchable) [size=64K]
	Region 2: Memory at 94900000 (64-bit, prefetchable) [size=128K]
	Region 4: Memory at 94e10000 (64-bit, non-prefetchable) [size=4K]
	Capabilities: [70] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 75.000W
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 16GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <1us, L1 <64us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s (downgraded), Width x8 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP-, LTR+
			 10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-, TPHComp-, ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
			 AtomicOpsCtl: ReqEn-
		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-

From the above information, it can be calculated that the theoretical bandwidth is about 8GB/s.

So, I tested the actual bandwidth through memcpy, it mainly includes the following two aspects:

Allocate memory through malloc, then memcpy.

    static void BM_memcpy(benchmark::State& state) {
	int64_t size = state.range(0);
    char* src = (char*) malloc(size);
    memset(src, 'b', size);
    char* dest = (char*) malloc(size);
	for (auto _: state) {
        memcpy(dest, src, size);
	}
    state.SetBytesProcessed(int64_t(state.iterations()) * size);
}

Map Shared RAM via mmap, then memcpy:

#define MAP_SIZE (1024 * 64)
#define MAP_MASK (MAP_SIZE - 1)

void* map_base = nullptr;
void* virt_addr = nullptr; 
 uint64_t target = 0x94e00000; // ep phy address
  int map_fd = open("/dev/mem", O_RDWR | O_ASYNC);
  void* map_base = mmap(0, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, map_fd, target & ~MAP_MASK);
  virt_addr = map_base + (target & MAP_MASK);
  int64_t size = state.range(0) > MAP_SIZE ? MAP_SIZE : state.range(0);
    char* src = (char*) malloc(size);
    memset(src, 'b', size);
    char* dest = (char*) virt_addr;
	for (auto _: state) {
        memcpy(dest, src, size);
	}
    state.SetBytesProcessed(int64_t(state.iterations()) * size);

Then I run the benchmark in the rp side (x86), the result is:

    Run on (16 X 5000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 0.07, 0.03, 0.01
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
--------------------------------------------------------------------------------------
Benchmark                            Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------
BM_memcpy/1024                    7.57 ns         7.57 ns     95032348 bytes_per_second=125.938G/s
BM_memcpy/4096                    28.1 ns         28.1 ns     24437451 bytes_per_second=135.9G/s
BM_memcpy/16384                    219 ns          219 ns      3203777 bytes_per_second=69.7667G/s
BM_memcpy/65536                   1096 ns         1096 ns       678141 bytes_per_second=55.6696G/s
BM_mempcy_target_addr/1024        1764 ns         1763 ns       396956 bytes_per_second=553.862M/s
BM_mempcy_target_addr/4096        7375 ns         7375 ns        94748 bytes_per_second=529.644M/s
BM_mempcy_target_addr/16384   11867018 ns     11866844 ns           57 bytes_per_second=1.31669M/s
BM_mempcy_target_addr/65536   58472399 ns     58471480 ns           12 bytes_per_second=1094.55k/s

According to the result of memory_target_addr, the actual bandwidth is up to 500MB/s, so is there any information I missed here, which causes my test results to vary greatly

Yeahhhh · November 21, 2022, 11:21pm

Hi，I found another post that had a somewhat similar situation to what I described, although he was using an ethernet interface：

Yeahhhh · November 23, 2022, 3:09pm

Can someone help me out? Please

kayccc · December 7, 2022, 2:37am

Please try with the next JetPack release, performance issue should be improved.

yenchao · December 7, 2022, 3:25am

Hi kayccc,

When will the next JetPack be released?

Thx
Yen

kayccc · December 7, 2022, 4:14am

Please refer to Jetson Roadmap | NVIDIA Developer, we will update to reflect the current status in coming weeks.

system · December 28, 2022, 4:05am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to change the number of PCIe lanes(from Width x1 to Width x8) Jetson AGX Orin pcie	10	2397	December 7, 2022
Orin Dev Kit Ethernet over PCIe EP performance Jetson AGX Orin pcie	4	1835	March 31, 2023
Fail to Testing Bidirectional Data Transfer Jetson AGX Orin pcie	10	960	November 18, 2022
AGX Orin Devkit PCIE bandwidth test Jetson AGX Orin pcie	6	910	February 21, 2024
Agx-Orin: PCIe RP could read ram memory after increasing aperture size for mapping non-prefetchable BARs of endpointson RP Jetson AGX Orin pcie , kernel	4	6	November 21, 2024
PCIe Bandwith issue with Lan adapter Jetson TX2 pcie	3	659	October 18, 2021
PCIe C5 x8 Link Speed Check Jetson AGX Orin board-design	7	963	December 14, 2022
PCIe EP/RP speedtest for virtual network and DMA Jetson AGX Orin pcie , rdma-and-roce	7	677	December 13, 2023
Enabling Orin Dev Kit PCIe EP mode Jetson AGX Orin pcie	22	5369	December 20, 2022
PCIe unable to assign memory for BARs Jetson AGX Orin pcie	4	953	February 26, 2024

Performance issues of data transmission speed in PCIe EP mode

Related topics