Gen 3 PCIe NVMe SSD with x4 lanes gets higher IOPS on Nano compared to the Xavier NX

Hi.

So I have been using the WD SN550 NVMe SSD (Model number: WDS250G2B0C)for testing on both the Xavier NX and the Jetson Nano. This is a Gen 3 PCIe NVMe SSD with x4 lanes.

This is the lspci output snippet from the Nano:

LnkCap:	Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <8us
            ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl:	ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
            ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta:	Speed 2.5GT/s (downgraded), Width x4 (ok)
            TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

This is the lspci output snippet from the Xavier NX:

LnkCap:	Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <8us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta:	Speed 8GT/s (ok), Width x4 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

As can been seen in the snippets, both Nano and NX use 4 lanes for data transfer. The Nano supports upto Gen2 (500 MB/s/lane) speeds. Whereas the NX supports Gen4 (1.97 GB/s/lane) speed.

I get higher sequential r/w speeds on the Xavier NX. But for random r/w the IOPS is lower on the Xavier NX.

According to the datasheet the SSD should hit the following speeds:
Random Read 4KB IOPS up to (Queues=32, Threads=16) - 165K
Random Write 4KB IOPS up to (Queues=32, Threads=16) - 160K

Actual speeds on Nano:
Random Read 4KB IOPS up to (Queues=32, Threads=16) - 157K
Random Write 4KB IOPS up to (Queues=32, Threads=16) - 162K

Actual speeds on Xavier NX:
Random Read 4KB IOPS up to (Queues=32, Threads=16) - 79.1K
Random Write 4KB IOPS up to (Queues=32, Threads=16) - 63.4K

Both the tests were conducted using our Custom Carrier Board. Can someone please comment on why this might be happening?

Complete lspci -vvvvv output on the Nano:

01:00.0 Non-Volatile memory controller: Sandisk Corp Device 5019 (rev 01) (prog-if 02 [NVM Express])
Subsystem: Sandisk Corp Device 5019
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 83
Region 0: Memory at 13000000 (64-bit, non-prefetchable) [size=16K]
Region 4: Memory at 13004000 (64-bit, non-prefetchable) [size=256]
Capabilities: [80] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [90] MSI: Enable- Count=1/32 Maskable- 64bit+
Address: 0000000000000000  Data: 0000
Capabilities: [b0] MSI-X: Enable+ Count=17 Masked-
Vector table: BAR=0 offset=00002000
PBA: BAR=4 offset=00000000
Capabilities: [c0] Express (v2) Endpoint, MSI 00
DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 unlimited
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap:	Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <8us
ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl:	ASPM L1 Enabled; RCB 64 bytes Disabled- CommClk+
ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
LnkSta:	Speed 2.5GT/s (downgraded), Width x4 (ok)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Range B, TimeoutDis+, NROPrPrP-, LTR+
10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt+, EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS-, TPHComp-, ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
AtomicOpsCtl: ReqEn-
LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [100 v2] Advanced Error Reporting
UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap:	First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Capabilities: [150 v1] Device Serial Number 00-00-00-00-00-00-00-00
Capabilities: [1b8 v1] Latency Tolerance Reporting
Max snoop latency: 0ns
Max no snoop latency: 0ns
Capabilities: [300 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
LaneErrStat: 0
Capabilities: [900 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+
PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1-
T_CommonMode=0us LTR1.2_Threshold=98304ns
L1SubCtl2: T_PwrOn=70us
Kernel driver in use: nvme

Complete lspci -vvvvv output on the Xavier NX:

0005:01:00.0 Non-Volatile memory controller: Sandisk Corp Device 5019 (rev 01) (prog-if 02 [NVM Express])
	Subsystem: Sandisk Corp Device 5019
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 35
	Region 0: Memory at 1f40000000 (64-bit, non-prefetchable) [size=16K]
	Region 4: Memory at 1f40004000 (64-bit, non-prefetchable) [size=256]
	Capabilities: [80] Power Management version 3
		Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [90] MSI: Enable- Count=1/32 Maskable- 64bit+
		Address: 0000000000000000  Data: 0000
	Capabilities: [b0] MSI-X: Enable+ Count=17 Masked-
		Vector table: BAR=0 offset=00002000
		PBA: BAR=4 offset=00000000
	Capabilities: [c0] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 unlimited
			ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
		DevCtl:	CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ FLReset-
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x4, ASPM L1, Exit Latency L1 <8us
			ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s (ok), Width x4 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range B, TimeoutDis+, NROPrPrP-, LTR+
			 10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt+, EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS-, TPHComp-, ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
			 AtomicOpsCtl: ReqEn-
		LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+, EqualizationPhase1+
			 EqualizationPhase2+, EqualizationPhase3+, LinkEqualizationRequest-
	Capabilities: [100 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [150 v1] Device Serial Number 00-00-00-00-00-00-00-00
	Capabilities: [1b8 v1] Latency Tolerance Reporting
		Max snoop latency: 0ns
		Max no snoop latency: 0ns
	Capabilities: [300 v1] Secondary PCI Express
		LnkCtl3: LnkEquIntrruptEn-, PerformEqu-
		LaneErrStat: 0
	Capabilities: [900 v1] L1 PM Substates
		L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1- ASPM_L1.2+ ASPM_L1.1- L1_PM_Substates+
			  PortCommonModeRestoreTime=32us PortTPowerOnTime=10us
		L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
			   T_CommonMode=0us LTR1.2_Threshold=0ns
		L1SubCtl2: T_PwrOn=40us
	Kernel driver in use: nvme

Complete fio output:

On the Nano with the fio utility I get the below speeds for Random Read 4KB IOPS up to (Queues=32, Threads=16):

fio --loops=1 --size=256m --filename=/dev/nvme0n1p6 --stonewall --ioengine=libaio --direct=1 --name=4kQD32T16RandRead --bs=4096 --rw=randread --iodepth=32 --numjobs=16 --group_reporting
4kQD32T16RandRead: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.17
Starting 16 processes
Jobs: 12 (f=11): [r(1),f(1),r(1),_(1),r(1),_(2),r(8),_(1)][40.0%][r=665MiB/s][r=170k IOPS][eta 00m:09s]
4kQD32T16RandRead: (groupid=0, jobs=16): err= 0: pid=4841: Fri Sep 23 06:30:33 2022
  read: IOPS=157k, BW=613MiB/s (642MB/s)(4096MiB/6685msec)
    slat (usec): min=9, max=98778, avg=18.41, stdev=291.38
    clat (usec): min=73, max=128679, avg=2858.22, stdev=2535.26
     lat (usec): min=88, max=128691, avg=2877.22, stdev=2563.12
    clat percentiles (usec):
     |  1.00th=[  363],  5.00th=[  502], 10.00th=[ 1319], 20.00th=[ 1942],
     | 30.00th=[ 2212], 40.00th=[ 2442], 50.00th=[ 2638], 60.00th=[ 2900],
     | 70.00th=[ 3195], 80.00th=[ 3621], 90.00th=[ 4293], 95.00th=[ 4883],
     | 99.00th=[ 7308], 99.50th=[ 8717], 99.90th=[40109], 99.95th=[58459],
     | 99.99th=[92799]
   bw (  KiB/s): min=488021, max=1060836, per=100.00%, avg=681457.09, stdev=15736.47, samples=178
   iops        : min=121999, max=265208, avg=170363.24, stdev=3934.14, samples=178
  lat (usec)   : 100=0.01%, 250=0.38%, 500=4.58%, 750=1.84%, 1000=1.26%
  lat (msec)   : 2=13.98%, 4=64.30%, 10=13.31%, 20=0.12%, 50=0.16%
  lat (msec)   : 100=0.06%, 250=0.01%
  cpu          : usr=4.99%, sys=15.60%, ctx=72393, majf=0, minf=831
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=1048576,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=613MiB/s (642MB/s), 613MiB/s-613MiB/s (642MB/s-642MB/s), io=4096MiB (4295MB), run=6685-6685msec

Disk stats (read/write):
  nvme0n1: ios=1043772/0, merge=0/0, ticks=2184252/0, in_queue=2212208, util=99.30%

For Random Write 4KB IOPS up to (Queues=32, Threads=16):

fio --loops=1 --size=256m --filename=/dev/nvme0n1p6 --stonewall --ioengine=libaio --direct=1 --name=4kQD32T16RandWrite --bs=4096 --rw=randwrite --iodepth=32 --numjobs=16 --group_reporting
4kQD32T16RandWrite: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.17
Starting 16 processes
Jobs: 14 (f=13): [_(1),f(1),w(6),_(1),w(7)][87.5%][w=617MiB/s][w=158k IOPS][eta 00m:01s]
4kQD32T16RandWrite: (groupid=0, jobs=16): err= 0: pid=4863: Fri Sep 23 06:31:41 2022
  write: IOPS=162k, BW=632MiB/s (662MB/s)(4096MiB/6485msec); 0 zone resets
    slat (usec): min=8, max=84853, avg=31.68, stdev=606.43
    clat (usec): min=24, max=105581, avg=2973.86, stdev=5548.81
     lat (usec): min=38, max=170046, avg=3006.24, stdev=5593.56
    clat percentiles (usec):
     |  1.00th=[  515],  5.00th=[  523], 10.00th=[  523], 20.00th=[  529],
     | 30.00th=[  668], 40.00th=[ 1713], 50.00th=[ 2245], 60.00th=[ 2343],
     | 70.00th=[ 2802], 80.00th=[ 2933], 90.00th=[ 4113], 95.00th=[ 9896],
     | 99.00th=[24511], 99.50th=[33424], 99.90th=[81265], 99.95th=[84411],
     | 99.99th=[86508]
   bw (  KiB/s): min=363328, max=835232, per=100.00%, avg=668391.54, stdev=8919.44, samples=189
   iops        : min=90832, max=208808, avg=167097.61, stdev=2229.86, samples=189
  lat (usec)   : 50=0.01%, 100=0.01%, 250=0.01%, 500=0.03%, 750=33.96%
  lat (usec)   : 1000=0.20%
  lat (msec)   : 2=8.64%, 4=46.68%, 10=5.51%, 20=3.34%, 50=1.23%
  lat (msec)   : 100=0.41%, 250=0.01%
  cpu          : usr=4.96%, sys=14.35%, ctx=30412, majf=0, minf=357
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,1048576,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=632MiB/s (662MB/s), 632MiB/s-632MiB/s (662MB/s-662MB/s), io=4096MiB (4295MB), run=6485-6485msec

Disk stats (read/write):
  nvme0n1: ios=134/1036966, merge=0/0, ticks=16/1262248, in_queue=1289604, util=99.29%

For the Xavier NX with the fio utility I get the below speeds for Random Read 4KB IOPS up to (Queues=32, Threads=16):

fio --loops=1 --size=256m --filename=/dev/nvme0n1p6 --stonewall --ioengine=libaio --direct=1 --name=4kQD32T16RandRead --bs=4096 --rw=randread --iodepth=32 --numjobs=16 --group_reporting
4kQD32T16RandRead: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.17
Starting 16 processes
Jobs: 4 (f=4): [_(2),r(1),_(1),r(1),_(1),r(1),_(5),r(1),_(3)][86.7%][r=312MiB/s][r=79.7k IOPS][eta 00m:02s]
4kQD32T16RandRead: (groupid=0, jobs=16): err= 0: pid=9101: Thu Sep 22 13:58:18 2022
  read: IOPS=79.1k, BW=309MiB/s (324MB/s)(4096MiB/13258msec)
    slat (usec): min=4, max=268105, avg=147.07, stdev=2301.29
    clat (usec): min=14, max=288124, avg=5764.67, stdev=13569.68
     lat (usec): min=98, max=288130, avg=5914.19, stdev=13722.22
    clat percentiles (usec):
     |  1.00th=[   306],  5.00th=[   355], 10.00th=[   379], 20.00th=[   408],
     | 30.00th=[   433], 40.00th=[   461], 50.00th=[   498], 60.00th=[   562],
     | 70.00th=[   873], 80.00th=[  2147], 90.00th=[ 26084], 95.00th=[ 41681],
     | 99.00th=[ 56361], 99.50th=[ 61604], 99.90th=[ 74974], 99.95th=[ 85459],
     | 99.99th=[154141]
   bw (  KiB/s): min=231623, max=657886, per=100.00%, avg=335004.23, stdev=6833.24, samples=379
   iops        : min=57905, max=164469, avg=83750.37, stdev=1708.28, samples=379
  lat (usec)   : 20=0.01%, 100=0.01%, 250=0.16%, 500=50.41%, 750=18.51%
  lat (usec)   : 1000=1.32%
  lat (msec)   : 2=6.90%, 4=7.13%, 10=1.38%, 20=2.79%, 50=9.41%
  lat (msec)   : 100=1.96%, 250=0.02%, 500=0.01%
  cpu          : usr=1.52%, sys=6.33%, ctx=9445, majf=0, minf=860
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=1048576,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=309MiB/s (324MB/s), 309MiB/s-309MiB/s (324MB/s-324MB/s), io=4096MiB (4295MB), run=13258-13258msec

Disk stats (read/write):
  nvme0n1: ios=1038873/0, merge=0/0, ticks=190372/0, in_queue=189852, util=99.32%

For Random Write 4KB IOPS up to (Queues=32, Threads=16):

fio --loops=1 --size=256m --filename=/dev/nvme0n1p6 --stonewall --ioengine=libaio --direct=1 --name=4kQD32T16RandWrite --bs=4096 --rw=randwrite --iodepth=32 --numjobs=16 --group_reporting
4kQD32T16RandWrite: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.17
Starting 16 processes
Jobs: 8 (f=8): [_(1),w(1),_(1),w(2),_(3),w(1),_(2),w(2),_(1),w(2)][88.9%][w=205MiB/s][w=52.4k IOPS][eta 00m:02s] 
4kQD32T16RandWrite: (groupid=0, jobs=16): err= 0: pid=9122: Thu Sep 22 13:58:35 2022
  write: IOPS=63.4k, BW=247MiB/s (259MB/s)(4096MiB/16551msec); 0 zone resets
    slat (usec): min=4, max=190152, avg=162.65, stdev=2482.81
    clat (usec): min=4, max=206981, avg=7282.10, stdev=15583.39
     lat (usec): min=20, max=206987, avg=7448.28, stdev=15746.29
    clat percentiles (usec):
     |  1.00th=[   429],  5.00th=[   441], 10.00th=[   449], 20.00th=[   457],
     | 30.00th=[   465], 40.00th=[   474], 50.00th=[   486], 60.00th=[   545],
     | 70.00th=[   668], 80.00th=[  8225], 90.00th=[ 31851], 95.00th=[ 44303],
     | 99.00th=[ 61604], 99.50th=[ 74974], 99.90th=[119014], 99.95th=[141558],
     | 99.99th=[196084]
   bw (  KiB/s): min=79238, max=442459, per=100.00%, avg=271356.11, stdev=6378.75, samples=480
   iops        : min=19809, max=110613, avg=67837.91, stdev=1594.69, samples=480
  lat (usec)   : 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%, 250=0.01%
  lat (usec)   : 500=55.64%, 750=15.14%, 1000=1.87%
  lat (msec)   : 2=1.57%, 4=1.83%, 10=4.86%, 20=5.35%, 50=10.97%
  lat (msec)   : 100=2.59%, 250=0.19%
  cpu          : usr=1.37%, sys=5.60%, ctx=15455, majf=0, minf=344
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,1048576,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=247MiB/s (259MB/s), 247MiB/s-247MiB/s (259MB/s-259MB/s), io=4096MiB (4295MB), run=16551-16551msec

Disk stats (read/write):
  nvme0n1: ios=224/1035930, merge=0/0, ticks=24/705740, in_queue=706248, util=95.46%

We didn’t do this kind of test, will forward to internal team to see if any suggestions. Thanks

Hi @kayccc ,

I might have found the fix for this. I am getting higher or better IOPS for random read/write after adding a custom power mode and setting it via nvpmodel. The custom power mode has all 6 CPUs online, has a minimum frequency of 1497600 and maxiumum frequency of 1907200 on all cores. The default mode had only 2 cores online, min frequency of 1190400 and max frequency of 1497600.

Power mode details:

# nvpmodel -q --verbose
NVPM VERB: Config file: /etc/nvpmodel.conf
NVPM VERB: parsing done for /etc/nvpmodel.conf
NV Fan Mode:cool
NVPM VERB: Current mode: NV Power Mode: MODE_CUSTOM_6CORE
5
NVPM VERB: PARAM CPU_ONLINE: ARG CORE_0: PATH /sys/devices/system/cpu/cpu0/online: REAL_VAL: 1 CONF_VAL: 1
NVPM VERB: PARAM CPU_ONLINE: ARG CORE_1: PATH /sys/devices/system/cpu/cpu1/online: REAL_VAL: 1 CONF_VAL: 1
NVPM VERB: PARAM CPU_ONLINE: ARG CORE_2: PATH /sys/devices/system/cpu/cpu2/online: REAL_VAL: 1 CONF_VAL: 1
NVPM VERB: PARAM CPU_ONLINE: ARG CORE_3: PATH /sys/devices/system/cpu/cpu3/online: REAL_VAL: 1 CONF_VAL: 1
NVPM VERB: PARAM CPU_ONLINE: ARG CORE_4: PATH /sys/devices/system/cpu/cpu4/online: REAL_VAL: 1 CONF_VAL: 1
NVPM VERB: PARAM CPU_ONLINE: ARG CORE_5: PATH /sys/devices/system/cpu/cpu5/online: REAL_VAL: 1 CONF_VAL: 1
NVPM VERB: PARAM TPC_POWER_GATING: ARG TPC_PG_MASK: PATH /sys/devices/gpu.0/tpc_pg_mask: REAL_VAL: 4 CONF_VAL: 1
NVPM VERB: PARAM GPU_POWER_CONTROL_ENABLE: ARG GPU_PWR_CNTL_EN: PATH /sys/devices/gpu.0/power/control: REAL_VAL: auto CONF_VAL: on
NVPM VERB: PARAM CPU_DENVER_0: ARG MIN_FREQ: PATH /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq: REAL_VAL: 1497600 CONF_VAL: 1497600
NVPM VERB: PARAM CPU_DENVER_0: ARG MAX_FREQ: PATH /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq: REAL_VAL: 1907200 CONF_VAL: 1907200
NVPM VERB: PARAM CPU_DENVER_1: ARG MIN_FREQ: PATH /sys/devices/system/cpu/cpu2/cpufreq/scaling_min_freq: REAL_VAL: 1497600 CONF_VAL: 1497600
NVPM VERB: PARAM CPU_DENVER_1: ARG MAX_FREQ: PATH /sys/devices/system/cpu/cpu2/cpufreq/scaling_max_freq: REAL_VAL: 1907200 CONF_VAL: 1907200
NVPM VERB: PARAM CPU_DENVER_2: ARG MIN_FREQ: PATH /sys/devices/system/cpu/cpu4/cpufreq/scaling_min_freq: REAL_VAL: 1497600 CONF_VAL: 1497600
NVPM VERB: PARAM CPU_DENVER_2: ARG MAX_FREQ: PATH /sys/devices/system/cpu/cpu4/cpufreq/scaling_max_freq: REAL_VAL: 1907200 CONF_VAL: 1907200
NVPM VERB: PARAM CPU_DENVER_3: ARG MIN_FREQ: PATH /sys/devices/system/cpu/cpu1/cpufreq/scaling_min_freq: REAL_VAL: 1497600 CONF_VAL: 1497600
NVPM VERB: PARAM CPU_DENVER_3: ARG MAX_FREQ: PATH /sys/devices/system/cpu/cpu1/cpufreq/scaling_max_freq: REAL_VAL: 1907200 CONF_VAL: 1907200
NVPM VERB: PARAM CPU_DENVER_4: ARG MIN_FREQ: PATH /sys/devices/system/cpu/cpu3/cpufreq/scaling_min_freq: REAL_VAL: 1497600 CONF_VAL: 1497600
NVPM VERB: PARAM CPU_DENVER_4: ARG MAX_FREQ: PATH /sys/devices/system/cpu/cpu3/cpufreq/scaling_max_freq: REAL_VAL: 1907200 CONF_VAL: 1907200
NVPM VERB: PARAM CPU_DENVER_5: ARG MIN_FREQ: PATH /sys/devices/system/cpu/cpu5/cpufreq/scaling_min_freq: REAL_VAL: 1497600 CONF_VAL: 1497600
NVPM VERB: PARAM CPU_DENVER_5: ARG MAX_FREQ: PATH /sys/devices/system/cpu/cpu5/cpufreq/scaling_max_freq: REAL_VAL: 1907200 CONF_VAL: 1907200
NVPM VERB: PARAM GPU: ARG MIN_FREQ: PATH /sys/devices/17000000.gv11b/devfreq/17000000.gv11b/min_freq: REAL_VAL: 114750000 CONF_VAL: 0
NVPM VERB: PARAM GPU: ARG MAX_FREQ: PATH /sys/devices/17000000.gv11b/devfreq/17000000.gv11b/max_freq: REAL_VAL: 1109250000 CONF_VAL: 1109250000
NVPM VERB: PARAM GPU_POWER_CONTROL_DISABLE: ARG GPU_PWR_CNTL_DIS: PATH /sys/devices/gpu.0/power/control: REAL_VAL: auto CONF_VAL: auto
NVPM VERB: PARAM EMC: ARG MAX_FREQ: PATH /sys/kernel/nvpmodel_emc_cap/emc_iso_cap: REAL_VAL: 1600000000 CONF_VAL: 1600000000
NVPM VERB: PARAM DLA_CORE: ARG MAX_FREQ: PATH /sys/kernel/nvpmodel_emc_cap/nafll_dla: REAL_VAL: 1100800000 CONF_VAL: 1100800000
NVPM VERB: PARAM DLA_FALCON: ARG MAX_FREQ: PATH /sys/kernel/nvpmodel_emc_cap/nafll_dla_falcon: REAL_VAL: 640000000 CONF_VAL: 640000000
NVPM VERB: PARAM PVA_VPS: ARG MAX_FREQ: PATH /sys/kernel/nvpmodel_emc_cap/nafll_pva_vps: REAL_VAL: 819200000 CONF_VAL: 819200000
NVPM VERB: PARAM PVA_CORE: ARG MAX_FREQ: PATH /sys/kernel/nvpmodel_emc_cap/nafll_pva_core: REAL_VAL: 601600000 CONF_VAL: 601600000
NVPM VERB: PARAM CVNAS: ARG MAX_FREQ: PATH /sys/kernel/nvpmodel_emc_cap/nafll_cvnas: REAL_VAL: 576000000 CONF_VAL: 576000000

After this change, I get the following results:

Random read on Xavier NX: IOPS=166k

# fio --loops=5 --size=256m --filename=/dev/nvme0n1p6 --stonewall --ioengine=libaio --direct=1 --name=4kQD32T16RandRead --bs=4096 --rw=randread --iodepth=32 --numjobs=16 --group_reporting
4kQD32T16RandRead: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.17
Starting 16 processes
Jobs: 4 (f=4): [_(1),r(1),_(1),r(1),_(2),r(1),_(1),r(1),_(7)][91.4%][r=663MiB/s][r=170k IOPS][eta 00m:03s]  
4kQD32T16RandRead: (groupid=0, jobs=16): err= 0: pid=5227: Wed Sep 28 16:42:44 2022
  read: IOPS=166k, BW=647MiB/s (679MB/s)(20.0GiB/31637msec)
    slat (usec): min=3, max=86287, avg=18.97, stdev=142.77
    clat (usec): min=29, max=846669, avg=2931.08, stdev=4463.16
     lat (usec): min=50, max=846685, avg=2950.35, stdev=4492.51
    clat percentiles (usec):
     |  1.00th=[   359],  5.00th=[   865], 10.00th=[  1172], 20.00th=[  1582],
     | 30.00th=[  1926], 40.00th=[  2245], 50.00th=[  2540], 60.00th=[  2900],
     | 70.00th=[  3294], 80.00th=[  3851], 90.00th=[  4752], 95.00th=[  5735],
     | 99.00th=[  7832], 99.50th=[ 12125], 99.90th=[ 35390], 99.95th=[ 51119],
     | 99.99th=[152044]
   bw (  KiB/s): min=183131, max=1335635, per=100.00%, avg=676378.71, stdev=13854.50, samples=954
   iops        : min=45781, max=333907, avg=169094.12, stdev=3463.61, samples=954
  lat (usec)   : 50=0.01%, 100=0.02%, 250=0.39%, 500=1.51%, 750=1.84%
  lat (usec)   : 1000=3.26%
  lat (msec)   : 2=25.11%, 4=49.94%, 10=17.32%, 20=0.33%, 50=0.22%
  lat (msec)   : 100=0.04%, 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  cpu          : usr=3.33%, sys=18.75%, ctx=2239982, majf=0, minf=852
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=5242880,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
   READ: bw=647MiB/s (679MB/s), 647MiB/s-647MiB/s (679MB/s-679MB/s), io=20.0GiB (21.5GB), run=31637-31637msec

Disk stats (read/write):
  nvme0n1: ios=5238713/0, merge=0/0, ticks=13995076/0, in_queue=14740488, util=100.00%

Random write on Xavier NX: IOPS=222k

# fio --loops=5 --size=256m --filename=/dev/nvme0n1p6 --stonewall --ioengine=libaio --direct=1 --name=4kQD32T16RandWrite --bs=4096 --rw=randwrite --iodepth=32 --numjobs=16 --group_reporting
4kQD32T16RandWrite: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.17
Starting 16 processes
Jobs: 15 (f=15): [w(3),_(1),w(12)][95.8%][w=894MiB/s][w=229k IOPS][eta 00m:01s]
4kQD32T16RandWrite: (groupid=0, jobs=16): err= 0: pid=5248: Wed Sep 28 16:43:14 2022
  write: IOPS=222k, BW=868MiB/s (910MB/s)(20.0GiB/23595msec); 0 zone resets
    slat (usec): min=3, max=64284, avg=31.95, stdev=402.90
    clat (usec): min=10, max=69358, avg=2227.51, stdev=3373.43
     lat (usec): min=31, max=69370, avg=2259.79, stdev=3397.75
    clat percentiles (usec):
     |  1.00th=[  478],  5.00th=[  553], 10.00th=[  586], 20.00th=[  611],
     | 30.00th=[  635], 40.00th=[  693], 50.00th=[ 1205], 60.00th=[ 1565],
     | 70.00th=[ 2024], 80.00th=[ 2737], 90.00th=[ 4555], 95.00th=[ 7832],
     | 99.00th=[18482], 99.50th=[22152], 99.90th=[30016], 99.95th=[33817],
     | 99.99th=[52167]
   bw (  KiB/s): min=649227, max=1308090, per=100.00%, avg=898054.53, stdev=8165.48, samples=732
   iops        : min=162305, max=327022, avg=224512.99, stdev=2041.37, samples=732
  lat (usec)   : 20=0.01%, 50=0.01%, 100=0.01%, 250=0.01%, 500=1.37%
  lat (usec)   : 750=41.75%, 1000=3.09%
  lat (msec)   : 2=23.21%, 4=19.00%, 10=7.98%, 20=2.83%, 50=0.74%
  lat (msec)   : 100=0.01%
  cpu          : usr=3.53%, sys=22.85%, ctx=203452, majf=0, minf=350
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,5242880,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=868MiB/s (910MB/s), 868MiB/s-868MiB/s (910MB/s-910MB/s), io=20.0GiB (21.5GB), run=23595-23595msec

Disk stats (read/write):
  nvme0n1: ios=1554/5242712, merge=0/0, ticks=1400/5108700, in_queue=5249716, util=99.62%

This might need testing in long term to ensure this works. Would still like to hear the comments of the internal team on the same.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.