Performance issues of data transmission speed in PCIe EP mode


After I confirmed that the connectivity of PCIe EP mode is correct:

I try to test data transfer speed with mmap and memcpy, The relevant information of EP and RP is as follows:

EP side (Orin):

root@orin:~# dmesg | grep pci_epf_nv_test
[ 3754.209715] pci_epf_nv_test pci_epf_nv_test.0: BAR0 RAM phys: 0x12f09c000
[ 3754.209745] pci_epf_nv_test pci_epf_nv_test.0: BAR0 RAM IOVA: 0xffff0000
[ 3754.209792] pci_epf_nv_test pci_epf_nv_test.0: BAR0 RAM virt: 0x00000000fc0b932a

RP side (x86):

# pci device tree
root@8208:~# lspci -tvv
-[0000:00]-+-00.0  Intel Corporation 8th Gen Core 8-core Desktop Processor Host Bridge/DRAM Registers [Coffee Lake S]
           +-01.0-[01-05]--+-00.0  NVIDIA Corporation Device 2216
           |               \-00.1  NVIDIA Corporation Device 1aef
           +-01.1-[06]----00.0  NVIDIA Corporation Device 0001

# PCIe theoretical bandwidth:
root@8208:~# lspci -s 06:00.0 -vvv
06:00.0 RAM memory: NVIDIA Corporation Device 0001
	Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Interrupt: pin A routed to IRQ 10
	Region 0: Memory at 94e00000 (32-bit, non-prefetchable) [size=64K]
	Region 2: Memory at 94900000 (64-bit, prefetchable) [size=128K]
	Region 4: Memory at 94e10000 (64-bit, non-prefetchable) [size=4K]
	Capabilities: [70] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 75.000W
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend-
		LnkCap:	Port #0, Speed 16GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <1us, L1 <64us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s (downgraded), Width x8 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, NROPrPrP-, LTR+
			 10BitTagComp+, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-, TPHComp-, ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
			 AtomicOpsCtl: ReqEn-
		LnkCtl2: Target Link Speed: 16GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-

From the above information, it can be calculated that the theoretical bandwidth is about 8GB/s.

So, I tested the actual bandwidth through memcpy, it mainly includes the following two aspects:

  1. Allocate memory through malloc, then memcpy.
    static void BM_memcpy(benchmark::State& state) {
	int64_t size = state.range(0);
    char* src = (char*) malloc(size);
    memset(src, 'b', size);
    char* dest = (char*) malloc(size);
	for (auto _: state) {
        memcpy(dest, src, size);
    state.SetBytesProcessed(int64_t(state.iterations()) * size);
  1. Map Shared RAM via mmap, then memcpy:
#define MAP_SIZE (1024 * 64)
#define MAP_MASK (MAP_SIZE - 1)

void* map_base = nullptr;
void* virt_addr = nullptr; 
 uint64_t target = 0x94e00000; // ep phy address
  int map_fd = open("/dev/mem", O_RDWR | O_ASYNC);
  void* map_base = mmap(0, MAP_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, map_fd, target & ~MAP_MASK);
  virt_addr = map_base + (target & MAP_MASK);
  int64_t size = state.range(0) > MAP_SIZE ? MAP_SIZE : state.range(0);
    char* src = (char*) malloc(size);
    memset(src, 'b', size);
    char* dest = (char*) virt_addr;
	for (auto _: state) {
        memcpy(dest, src, size);
    state.SetBytesProcessed(int64_t(state.iterations()) * size);

Then I run the benchmark in the rp side (x86), the result is:

    Run on (16 X 5000 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x8)
  L1 Instruction 32 KiB (x8)
  L2 Unified 256 KiB (x8)
  L3 Unified 16384 KiB (x1)
Load Average: 0.07, 0.03, 0.01
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
Benchmark                            Time             CPU   Iterations UserCounters...
BM_memcpy/1024                    7.57 ns         7.57 ns     95032348 bytes_per_second=125.938G/s
BM_memcpy/4096                    28.1 ns         28.1 ns     24437451 bytes_per_second=135.9G/s
BM_memcpy/16384                    219 ns          219 ns      3203777 bytes_per_second=69.7667G/s
BM_memcpy/65536                   1096 ns         1096 ns       678141 bytes_per_second=55.6696G/s
BM_mempcy_target_addr/1024        1764 ns         1763 ns       396956 bytes_per_second=553.862M/s
BM_mempcy_target_addr/4096        7375 ns         7375 ns        94748 bytes_per_second=529.644M/s
BM_mempcy_target_addr/16384   11867018 ns     11866844 ns           57 bytes_per_second=1.31669M/s
BM_mempcy_target_addr/65536   58472399 ns     58471480 ns           12 bytes_per_second=1094.55k/s

According to the result of memory_target_addr, the actual bandwidth is up to 500MB/s, so is there any information I missed here, which causes my test results to vary greatly

Hi,I found another post that had a somewhat similar situation to what I described, although he was using an ethernet interface:

Can someone help me out? Please

Please try with the next JetPack release, performance issue should be improved.

Hi kayccc,

When will the next JetPack be released?


Please refer to Jetson Roadmap | NVIDIA Developer, we will update to reflect the current status in coming weeks.