Gen 3 PCIe NVMe SSD with x4 lanes gets higher IOPS on Nano compared to the Xavier NX

Hi.

So I have been using the WD SN550 NVMe SSD (Model number: WDS250G2B0C)for testing on both the Xavier NX and the Jetson Nano. This is a Gen 3 PCIe NVMe SSD with x4 lanes.

This is the lspci output snippet from the Nano:

LnkCap:	Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <8us
            ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl:	ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
            ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta:	Speed 2.5GT/s (downgraded), Width x4 (ok)
            TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

This is the lspci output snippet from the Xavier NX:

LnkCap:	Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <8us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta:	Speed 8GT/s (ok), Width x4 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

As can been seen in the snippets, both Nano and NX use 4 lanes for data transfer. The Nano supports upto Gen2 (500 MB/s/lane) speeds. Whereas the NX supports Gen4 (1.97 GB/s/lane) speed.

I get higher sequential r/w speeds on the Xavier NX. But for random r/w the IOPS is lower on the Xavier NX.

According to the datasheet the SSD should hit the following speeds:
Random Read 4KB IOPS up to (Queues=32, Threads=16) - 165K
Random Write 4KB IOPS up to (Queues=32, Threads=16) - 160K

Actual speeds on Nano:
Random Read 4KB IOPS up to (Queues=32, Threads=16) - 157K
Random Write 4KB IOPS up to (Queues=32, Threads=16) - 162K

Actual speeds on Xavier NX:
Random Read 4KB IOPS up to (Queues=32, Threads=16) - 79.1K
Random Write 4KB IOPS up to (Queues=32, Threads=16) - 63.4K

Both the tests were conducted using our Custom Carrier Board. Can someone please comment on why this might be happening?

Complete lspci -vvvvv output on the Nano:

01:00.0 Non-Volatile memory controller: Sandisk Corp Device 5019 (rev 01) (prog-if 02 [NVM Express])
Subsystem: Sandisk Corp Device 5019
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 83
Region 0: Memory at 13000000 (64-bit, non-prefetchable) [size=16K]
Region 4: Memory at 13004000 (64-bit, non-prefetchable) [size=256]
Capabilities: [80] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [90] MSI: Enable- Count=1/32 Maskable- 64bit+
Address: 0000000000000000  Data: 0000
Capabilities: [b0] MSI-X: Enable+ Count=17 Masked-
Vector table: BAR=0 offset=00002000
PBA: BAR=4 offset=00000000
Capabilities: [c0] Express (v2) Endpoint, MSI 00
DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 unlimited
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap:	Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <8us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl:	ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta:	Speed 2.5GT/s (downgraded), Width x4 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range B, TimeoutDis+, NROPrPrP-, LTR+
10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt+, EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-, TPHComp-, ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
AtomicOpsCtl: ReqEn-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v2] Advanced Error Reporting
UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap:	First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [150 v1] Device Serial Number 00-00-00-00-00-00-00-00
Capabilities: [1b8 v1] Latency Tolerance Reporting
Max snoop latency: 0ns
Max no snoop latency: 0ns
Capabilities: [300 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
LaneErrStat: 0
Capabilities: [900 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+
PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
T_CommonMode=0us LTR1.2_Threshold=98304ns
L1SubCtl2: T_PwrOn=70us
Kernel driver in use: nvme

Complete lspci -vvvvv output on the Xavier NX:

0005:01:00.0 Non-Volatile memory controller: Sandisk Corp Device 5019 (rev 01) (prog-if 02 [NVM Express])
	Subsystem: Sandisk Corp Device 5019
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 35
	Region 0: Memory at 1f40000000 (64-bit, non-prefetchable) [size=16K]
	Region 4: Memory at 1f40004000 (64-bit, non-prefetchable) [size=256]
	Capabilities: [80] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [90] MSI: Enable- Count=1/32 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [b0] MSI-X: Enable+ Count=17 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=4 offset=00000000
	Capabilities: [c0] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 unlimited
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <8us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s (ok), Width x4 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range B, TimeoutDis+, NROPrPrP-, LTR+
			 10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt+, EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-, TPHComp-, ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
			 AtomicOpsCtl: ReqEn-
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [100 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [150 v1] Device Serial Number 00-00-00-00-00-00-00-00
	Capabilities: [1b8 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [300 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
		LaneErrStat: 0
	Capabilities: [900 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+
			  PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=0us LTR1.2_Threshold=0ns
		L1SubCtl2: T_PwrOn=40us
	Kernel driver in use: nvme

Complete fio output:

On the Nano with the fio utility I get the below speeds for Random Read 4KB IOPS up to (Queues=32, Threads=16):

fio --loops=1 --size=256m --filename=/dev/nvme0n1p6 --stonewall --ioengine=libaio --direct=1 --name=4kQD32T16RandRead --bs=4096 --rw=randread --iodepth=32 --numjobs=16 --group_reporting
4kQD32T16RandRead: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.17
Starting 16 processes
Jobs: 12 (f=11): [r(1),f(1),r(1),_(1),r(1),_(2),r(8),_(1)][40.0%][r=665MiB/s][r=170k IOPS][eta 00m:09s]
4kQD32T16RandRead: (groupid=0, jobs=16): err= 0: pid=4841: Fri Sep 23 06:30:33 2022
  read: IOPS=157k, BW=613MiB/s (642MB/s)(4096MiB/6685msec)
    slat (usec): min=9, max=98778, avg=18.41, stdev=291.38
    clat (usec): min=73, max=128679, avg=2858.22, stdev=2535.26
     lat (usec): min=88, max=128691, avg=2877.22, stdev=2563.12
    clat percentiles (usec):
     |  1.00th=[  363],  5.00th=[  502], 10.00th=[ 1319], 20.00th=[ 1942],
     | 30.00th=[ 2212], 40.00th=[ 2442], 50.00th=[ 2638], 60.00th=[ 2900],
     | 70.00th=[ 3195], 80.00th=[ 3621], 90.00th=[ 4293], 95.00th=[ 4883],
     | 99.00th=[ 7308], 99.50th=[ 8717], 99.90th=[40109], 99.95th=[58459],
     | 99.99th=[92799]
   bw (  KiB/s): min=488021, max=1060836, per=100.00%, avg=681457.09, stdev=15736.47, samples=178
   iops        : min=121999, max=265208, avg=170363.24, stdev=3934.14, samples=178
  lat (usec)   : 100=0.01%, 250=0.38%, 500=4.58%, 750=1.84%, 1000=1.26%
  lat (msec)   : 2=13.98%, 4=64.30%, 10=13.31%, 20=0.12%, 50=0.16%
  lat (msec)   : 100=0.06%, 250=0.01%
  cpu          : usr=4.99%, sys=15.60%, ctx=72393, majf=0, minf=831
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=1048576,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=613MiB/s (642MB/s), 613MiB/s-613MiB/s (642MB/s-642MB/s), io=4096MiB (4295MB), run=6685-6685msec

Disk stats (read/write):
  nvme0n1: ios=1043772/0, merge=0/0, ticks=2184252/0, in_queue=2212208, util=99.30%

For Random Write 4KB IOPS up to (Queues=32, Threads=16):

fio --loops=1 --size=256m --filename=/dev/nvme0n1p6 --stonewall --ioengine=libaio --direct=1 --name=4kQD32T16RandWrite --bs=4096 --rw=randwrite --iodepth=32 --numjobs=16 --group_reporting
4kQD32T16RandWrite: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.17
Starting 16 processes
Jobs: 14 (f=13): [_(1),f(1),w(6),_(1),w(7)][87.5%][w=617MiB/s][w=158k IOPS][eta 00m:01s]
4kQD32T16RandWrite: (groupid=0, jobs=16): err= 0: pid=4863: Fri Sep 23 06:31:41 2022
  write: IOPS=162k, BW=632MiB/s (662MB/s)(4096MiB/6485msec); 0 zone resets
    slat (usec): min=8, max=84853, avg=31.68, stdev=606.43
    clat (usec): min=24, max=105581, avg=2973.86, stdev=5548.81
     lat (usec): min=38, max=170046, avg=3006.24, stdev=5593.56
    clat percentiles (usec):
     |  1.00th=[  515],  5.00th=[  523], 10.00th=[  523], 20.00th=[  529],
     | 30.00th=[  668], 40.00th=[ 1713], 50.00th=[ 2245], 60.00th=[ 2343],
     | 70.00th=[ 2802], 80.00th=[ 2933], 90.00th=[ 4113], 95.00th=[ 9896],
     | 99.00th=[24511], 99.50th=[33424], 99.90th=[81265], 99.95th=[84411],
     | 99.99th=[86508]
   bw (  KiB/s): min=363328, max=835232, per=100.00%, avg=668391.54, stdev=8919.44, samples=189
   iops        : min=90832, max=208808, avg=167097.61, stdev=2229.86, samples=189
  lat (usec)   : 50=0.01%, 100=0.01%, 250=0.01%, 500=0.03%, 750=33.96%
  lat (usec)   : 1000=0.20%
  lat (msec)   : 2=8.64%, 4=46.68%, 10=5.51%, 20=3.34%, 50=1.23%
  lat (msec)   : 100=0.41%, 250=0.01%
  cpu          : usr=4.96%, sys=14.35%, ctx=30412, majf=0, minf=357
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,1048576,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=632MiB/s (662MB/s), 632MiB/s-632MiB/s (662MB/s-662MB/s), io=4096MiB (4295MB), run=6485-6485msec

Disk stats (read/write):
  nvme0n1: ios=134/1036966, merge=0/0, ticks=16/1262248, in_queue=1289604, util=99.29%

For the Xavier NX with the fio utility I get the below speeds for Random Read 4KB IOPS up to (Queues=32, Threads=16):

fio --loops=1 --size=256m --filename=/dev/nvme0n1p6 --stonewall --ioengine=libaio --direct=1 --name=4kQD32T16RandRead --bs=4096 --rw=randread --iodepth=32 --numjobs=16 --group_reporting
4kQD32T16RandRead: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.17
Starting 16 processes
Jobs: 4 (f=4): [_(2),r(1),_(1),r(1),_(1),r(1),_(5),r(1),_(3)][86.7%][r=312MiB/s][r=79.7k IOPS][eta 00m:02s]
4kQD32T16RandRead: (groupid=0, jobs=16): err= 0: pid=9101: Thu Sep 22 13:58:18 2022
  read: IOPS=79.1k, BW=309MiB/s (324MB/s)(4096MiB/13258msec)
    slat (usec): min=4, max=268105, avg=147.07, stdev=2301.29
    clat (usec): min=14, max=288124, avg=5764.67, stdev=13569.68
     lat (usec): min=98, max=288130, avg=5914.19, stdev=13722.22
    clat percentiles (usec):
     |  1.00th=[   306],  5.00th=[   355], 10.00th=[   379], 20.00th=[   408],
     | 30.00th=[   433], 40.00th=[   461], 50.00th=[   498], 60.00th=[   562],
     | 70.00th=[   873], 80.00th=[  2147], 90.00th=[ 26084], 95.00th=[ 41681],
     | 99.00th=[ 56361], 99.50th=[ 61604], 99.90th=[ 74974], 99.95th=[ 85459],
     | 99.99th=[154141]
   bw (  KiB/s): min=231623, max=657886, per=100.00%, avg=335004.23, stdev=6833.24, samples=379
   iops        : min=57905, max=164469, avg=83750.37, stdev=1708.28, samples=379
  lat (usec)   : 20=0.01%, 100=0.01%, 250=0.16%, 500=50.41%, 750=18.51%
  lat (usec)   : 1000=1.32%
  lat (msec)   : 2=6.90%, 4=7.13%, 10=1.38%, 20=2.79%, 50=9.41%
  lat (msec)   : 100=1.96%, 250=0.02%, 500=0.01%
  cpu          : usr=1.52%, sys=6.33%, ctx=9445, majf=0, minf=860
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=1048576,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=309MiB/s (324MB/s), 309MiB/s-309MiB/s (324MB/s-324MB/s), io=4096MiB (4295MB), run=13258-13258msec

Disk stats (read/write):
  nvme0n1: ios=1038873/0, merge=0/0, ticks=190372/0, in_queue=189852, util=99.32%

For Random Write 4KB IOPS up to (Queues=32, Threads=16):

fio --loops=1 --size=256m --filename=/dev/nvme0n1p6 --stonewall --ioengine=libaio --direct=1 --name=4kQD32T16RandWrite --bs=4096 --rw=randwrite --iodepth=32 --numjobs=16 --group_reporting
4kQD32T16RandWrite: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.17
Starting 16 processes
Jobs: 8 (f=8): [_(1),w(1),_(1),w(2),_(3),w(1),_(2),w(2),_(1),w(2)][88.9%][w=205MiB/s][w=52.4k IOPS][eta 00m:02s] 
4kQD32T16RandWrite: (groupid=0, jobs=16): err= 0: pid=9122: Thu Sep 22 13:58:35 2022
  write: IOPS=63.4k, BW=247MiB/s (259MB/s)(4096MiB/16551msec); 0 zone resets
    slat (usec): min=4, max=190152, avg=162.65, stdev=2482.81
    clat (usec): min=4, max=206981, avg=7282.10, stdev=15583.39
     lat (usec): min=20, max=206987, avg=7448.28, stdev=15746.29
    clat percentiles (usec):
     |  1.00th=[   429],  5.00th=[   441], 10.00th=[   449], 20.00th=[   457],
     | 30.00th=[   465], 40.00th=[   474], 50.00th=[   486], 60.00th=[   545],
     | 70.00th=[   668], 80.00th=[  8225], 90.00th=[ 31851], 95.00th=[ 44303],
     | 99.00th=[ 61604], 99.50th=[ 74974], 99.90th=[119014], 99.95th=[141558],
     | 99.99th=[196084]
   bw (  KiB/s): min=79238, max=442459, per=100.00%, avg=271356.11, stdev=6378.75, samples=480
   iops        : min=19809, max=110613, avg=67837.91, stdev=1594.69, samples=480
  lat (usec)   : 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%, 250=0.01%
  lat (usec)   : 500=55.64%, 750=15.14%, 1000=1.87%
  lat (msec)   : 2=1.57%, 4=1.83%, 10=4.86%, 20=5.35%, 50=10.97%
  lat (msec)   : 100=2.59%, 250=0.19%
  cpu          : usr=1.37%, sys=5.60%, ctx=15455, majf=0, minf=344
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,1048576,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=247MiB/s (259MB/s), 247MiB/s-247MiB/s (259MB/s-259MB/s), io=4096MiB (4295MB), run=16551-16551msec

Disk stats (read/write):
  nvme0n1: ios=224/1035930, merge=0/0, ticks=24/705740, in_queue=706248, util=95.46%

We didn’t do this kind of test, will forward to internal team to see if any suggestions. Thanks

Hi @kayccc ,

I might have found the fix for this. I am getting higher or better IOPS for random read/write after adding a custom power mode and setting it via nvpmodel. The custom power mode has all 6 CPUs online, has a minimum frequency of 1497600 and maxiumum frequency of 1907200 on all cores. The default mode had only 2 cores online, min frequency of 1190400 and max frequency of 1497600.

Power mode details:

# nvpmodel -q --verbose
NVPM VERB: Config file: /etc/nvpmodel.conf
NVPM VERB: parsing done for /etc/nvpmodel.conf
NV Fan Mode:cool
NVPM VERB: Current mode: NV Power Mode: MODE_CUSTOM_6CORE
5
NVPM VERB: PARAM CPU_ONLINE: ARG CORE_0: PATH /sys/devices/system/cpu/cpu0/online: REAL_VAL: 1 CONF_VAL: 1
NVPM VERB: PARAM CPU_ONLINE: ARG CORE_1: PATH /sys/devices/system/cpu/cpu1/online: REAL_VAL: 1 CONF_VAL: 1
NVPM VERB: PARAM CPU_ONLINE: ARG CORE_2: PATH /sys/devices/system/cpu/cpu2/online: REAL_VAL: 1 CONF_VAL: 1
NVPM VERB: PARAM CPU_ONLINE: ARG CORE_3: PATH /sys/devices/system/cpu/cpu3/online: REAL_VAL: 1 CONF_VAL: 1
NVPM VERB: PARAM CPU_ONLINE: ARG CORE_4: PATH /sys/devices/system/cpu/cpu4/online: REAL_VAL: 1 CONF_VAL: 1
NVPM VERB: PARAM CPU_ONLINE: ARG CORE_5: PATH /sys/devices/system/cpu/cpu5/online: REAL_VAL: 1 CONF_VAL: 1
NVPM VERB: PARAM TPC_POWER_GATING: ARG TPC_PG_MASK: PATH /sys/devices/gpu.0/tpc_pg_mask: REAL_VAL: 4 CONF_VAL: 1
NVPM VERB: PARAM GPU_POWER_CONTROL_ENABLE: ARG GPU_PWR_CNTL_EN: PATH /sys/devices/gpu.0/power/control: REAL_VAL: auto CONF_VAL: on
NVPM VERB: PARAM CPU_DENVER_0: ARG MIN_FREQ: PATH /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq: REAL_VAL: 1497600 CONF_VAL: 1497600
NVPM VERB: PARAM CPU_DENVER_0: ARG MAX_FREQ: PATH /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq: REAL_VAL: 1907200 CONF_VAL: 1907200
NVPM VERB: PARAM CPU_DENVER_1: ARG MIN_FREQ: PATH /sys/devices/system/cpu/cpu2/cpufreq/scaling_min_freq: REAL_VAL: 1497600 CONF_VAL: 1497600
NVPM VERB: PARAM CPU_DENVER_1: ARG MAX_FREQ: PATH /sys/devices/system/cpu/cpu2/cpufreq/scaling_max_freq: REAL_VAL: 1907200 CONF_VAL: 1907200
NVPM VERB: PARAM CPU_DENVER_2: ARG MIN_FREQ: PATH /sys/devices/system/cpu/cpu4/cpufreq/scaling_min_freq: REAL_VAL: 1497600 CONF_VAL: 1497600
NVPM VERB: PARAM CPU_DENVER_2: ARG MAX_FREQ: PATH /sys/devices/system/cpu/cpu4/cpufreq/scaling_max_freq: REAL_VAL: 1907200 CONF_VAL: 1907200
NVPM VERB: PARAM CPU_DENVER_3: ARG MIN_FREQ: PATH /sys/devices/system/cpu/cpu1/cpufreq/scaling_min_freq: REAL_VAL: 1497600 CONF_VAL: 1497600
NVPM VERB: PARAM CPU_DENVER_3: ARG MAX_FREQ: PATH /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq: REAL_VAL: 1907200 CONF_VAL: 1907200
NVPM VERB: PARAM CPU_DENVER_4: ARG MIN_FREQ: PATH /sys/devices/system/cpu/cpu3/cpufreq/scaling_min_freq: REAL_VAL: 1497600 CONF_VAL: 1497600
NVPM VERB: PARAM CPU_DENVER_4: ARG MAX_FREQ: PATH /sys/devices/system/cpu/cpu3/cpufreq/scaling_max_freq: REAL_VAL: 1907200 CONF_VAL: 1907200
NVPM VERB: PARAM CPU_DENVER_5: ARG MIN_FREQ: PATH /sys/devices/system/cpu/cpu5/cpufreq/scaling_min_freq: REAL_VAL: 1497600 CONF_VAL: 1497600
NVPM VERB: PARAM CPU_DENVER_5: ARG MAX_FREQ: PATH /sys/devices/system/cpu/cpu5/cpufreq/scaling_max_freq: REAL_VAL: 1907200 CONF_VAL: 1907200
NVPM VERB: PARAM GPU: ARG MIN_FREQ: PATH /sys/devices/17000000.gv11b/devfreq/17000000.gv11b/min_freq: REAL_VAL: 114750000 CONF_VAL: 0
NVPM VERB: PARAM GPU: ARG MAX_FREQ: PATH /sys/devices/17000000.gv11b/devfreq/17000000.gv11b/max_freq: REAL_VAL: 1109250000 CONF_VAL: 1109250000
NVPM VERB: PARAM GPU_POWER_CONTROL_DISABLE: ARG GPU_PWR_CNTL_DIS: PATH /sys/devices/gpu.0/power/control: REAL_VAL: auto CONF_VAL: auto
NVPM VERB: PARAM EMC: ARG MAX_FREQ: PATH /sys/kernel/nvpmodel_emc_cap/emc_iso_cap: REAL_VAL: 1600000000 CONF_VAL: 1600000000
NVPM VERB: PARAM DLA_CORE: ARG MAX_FREQ: PATH /sys/kernel/nvpmodel_emc_cap/nafll_dla: REAL_VAL: 1100800000 CONF_VAL: 1100800000
NVPM VERB: PARAM DLA_FALCON: ARG MAX_FREQ: PATH /sys/kernel/nvpmodel_emc_cap/nafll_dla_falcon: REAL_VAL: 640000000 CONF_VAL: 640000000
NVPM VERB: PARAM PVA_VPS: ARG MAX_FREQ: PATH /sys/kernel/nvpmodel_emc_cap/nafll_pva_vps: REAL_VAL: 819200000 CONF_VAL: 819200000
NVPM VERB: PARAM PVA_CORE: ARG MAX_FREQ: PATH /sys/kernel/nvpmodel_emc_cap/nafll_pva_core: REAL_VAL: 601600000 CONF_VAL: 601600000
NVPM VERB: PARAM CVNAS: ARG MAX_FREQ: PATH /sys/kernel/nvpmodel_emc_cap/nafll_cvnas: REAL_VAL: 576000000 CONF_VAL: 576000000

After this change, I get the following results:

Random read on Xavier NX: IOPS=166k

# fio --loops=5 --size=256m --filename=/dev/nvme0n1p6 --stonewall --ioengine=libaio --direct=1 --name=4kQD32T16RandRead --bs=4096 --rw=randread --iodepth=32 --numjobs=16 --group_reporting
4kQD32T16RandRead: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.17
Starting 16 processes
Jobs: 4 (f=4): [_(1),r(1),_(1),r(1),_(2),r(1),_(1),r(1),_(7)][91.4%][r=663MiB/s][r=170k IOPS][eta 00m:03s]  
4kQD32T16RandRead: (groupid=0, jobs=16): err= 0: pid=5227: Wed Sep 28 16:42:44 2022
  read: IOPS=166k, BW=647MiB/s (679MB/s)(20.0GiB/31637msec)
    slat (usec): min=3, max=86287, avg=18.97, stdev=142.77
    clat (usec): min=29, max=846669, avg=2931.08, stdev=4463.16
     lat (usec): min=50, max=846685, avg=2950.35, stdev=4492.51
    clat percentiles (usec):
     |  1.00th=[   359],  5.00th=[   865], 10.00th=[  1172], 20.00th=[  1582],
     | 30.00th=[  1926], 40.00th=[  2245], 50.00th=[  2540], 60.00th=[  2900],
     | 70.00th=[  3294], 80.00th=[  3851], 90.00th=[  4752], 95.00th=[  5735],
     | 99.00th=[  7832], 99.50th=[ 12125], 99.90th=[ 35390], 99.95th=[ 51119],
     | 99.99th=[152044]
   bw (  KiB/s): min=183131, max=1335635, per=100.00%, avg=676378.71, stdev=13854.50, samples=954
   iops        : min=45781, max=333907, avg=169094.12, stdev=3463.61, samples=954
  lat (usec)   : 50=0.01%, 100=0.02%, 250=0.39%, 500=1.51%, 750=1.84%
  lat (usec)   : 1000=3.26%
  lat (msec)   : 2=25.11%, 4=49.94%, 10=17.32%, 20=0.33%, 50=0.22%
  lat (msec)   : 100=0.04%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  cpu          : usr=3.33%, sys=18.75%, ctx=2239982, majf=0, minf=852
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=5242880,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=647MiB/s (679MB/s), 647MiB/s-647MiB/s (679MB/s-679MB/s), io=20.0GiB (21.5GB), run=31637-31637msec

Disk stats (read/write):
  nvme0n1: ios=5238713/0, merge=0/0, ticks=13995076/0, in_queue=14740488, util=100.00%

Random write on Xavier NX: IOPS=222k

# fio --loops=5 --size=256m --filename=/dev/nvme0n1p6 --stonewall --ioengine=libaio --direct=1 --name=4kQD32T16RandWrite --bs=4096 --rw=randwrite --iodepth=32 --numjobs=16 --group_reporting
4kQD32T16RandWrite: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.17
Starting 16 processes
Jobs: 15 (f=15): [w(3),_(1),w(12)][95.8%][w=894MiB/s][w=229k IOPS][eta 00m:01s]
4kQD32T16RandWrite: (groupid=0, jobs=16): err= 0: pid=5248: Wed Sep 28 16:43:14 2022
  write: IOPS=222k, BW=868MiB/s (910MB/s)(20.0GiB/23595msec); 0 zone resets
    slat (usec): min=3, max=64284, avg=31.95, stdev=402.90
    clat (usec): min=10, max=69358, avg=2227.51, stdev=3373.43
     lat (usec): min=31, max=69370, avg=2259.79, stdev=3397.75
    clat percentiles (usec):
     |  1.00th=[  478],  5.00th=[  553], 10.00th=[  586], 20.00th=[  611],
     | 30.00th=[  635], 40.00th=[  693], 50.00th=[ 1205], 60.00th=[ 1565],
     | 70.00th=[ 2024], 80.00th=[ 2737], 90.00th=[ 4555], 95.00th=[ 7832],
     | 99.00th=[18482], 99.50th=[22152], 99.90th=[30016], 99.95th=[33817],
     | 99.99th=[52167]
   bw (  KiB/s): min=649227, max=1308090, per=100.00%, avg=898054.53, stdev=8165.48, samples=732
   iops        : min=162305, max=327022, avg=224512.99, stdev=2041.37, samples=732
  lat (usec)   : 20=0.01%, 50=0.01%, 100=0.01%, 250=0.01%, 500=1.37%
  lat (usec)   : 750=41.75%, 1000=3.09%
  lat (msec)   : 2=23.21%, 4=19.00%, 10=7.98%, 20=2.83%, 50=0.74%
  lat (msec)   : 100=0.01%
  cpu          : usr=3.53%, sys=22.85%, ctx=203452, majf=0, minf=350
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,5242880,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=868MiB/s (910MB/s), 868MiB/s-868MiB/s (910MB/s-910MB/s), io=20.0GiB (21.5GB), run=23595-23595msec

Disk stats (read/write):
  nvme0n1: ios=1554/5242712, merge=0/0, ticks=1400/5108700, in_queue=5249716, util=99.62%

This might need testing in long term to ensure this works. Would still like to hear the comments of the internal team on the same.