DevKit NVMe performance

Samsung Electronics Co Ltd NVMe SSD Controller SM951/PM951 (rev 01)

608 MB/s sync write
1.2 GB/s read

Not too shabby!

jetson2 /mnt# mount /dev/nvme0n1p2 ./nvmep2
jetson2 /mnt# dd if=/dev/zero of=./nvmep2/root/.ddtest bs=1M count=1000 conv=fsync
1000+0 records in 
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 1.72349 s, 608 MB/s
jetson2 /mnt# umount nvmep2
jetson2 /mnt# sync
jetson2 /mnt# mount /dev/nvme0n1p2 ./nvmep2
jetson2 /mnt# dd if=./nvmep2/root/.ddtest of=/dev/null bs=1M count=1000
1000+0 records in
1000+0 records out
1048576000 bytes (1.0 GB, 1000 MiB) copied, 0.867345 s, 1.2 GB/s

Hi gtj,

Please try to use “IOZone” utility to test NVMe read/write performance.

sudo apt-get update
sudo apt-get install iozone3
sudo iozone -ecI -+n -L64 -S32 -s64m -r4k -i0 -i2 -l8 -u8 -o -m -t8 -F /mnt/file1 /mnt/file2 /mnt/file3 /mnt/file4 /mnt/file5 /mnt/file6 /mnt/file7 /mnt/file8

Before running, please set max performance mode:

sudo nvpmodel -m 0
sudo jetson_clocks

1 Like

I wasn’t complaining. I was impressed!

Anyway, here’s the iozone results which are not very impressive. :)

jetson2 /~# nvpmodel -m 0
NVPM WARN: patching tpc_pg_mask: (0x1:0x2)
NVPM WARN: patched tpc_pg_mask: 0x2
jetson2 /~# jetson_clocks
jetson2 /~# iozone -ecI -+n -L64 -S32 -s64m -r4k -i0 -i2 -l8 -u8 -o -m -t8 -F /mnt/file1 /mnt/file2 /mnt/file3 /mnt/file4 /mnt/file5 /mnt/file6 /mnt/file7 /mnt/file8
	Iozone: Performance Test of File I/O
	        Version $Revision: 3.429 $
		Compiled for 64 bit mode.
		Build: linux 

	Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
	             Al Slater, Scott Rhine, Mike Wisner, Ken Goss
	             Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
	             Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
	             Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
	             Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
	             Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer,
	             Vangel Bojaxhi, Ben England, Vikentsi Lapa.

	Run began: Sun May 17 21:10:01 2020

	Include fsync in write timing
	Include close in write timing
	O_DIRECT feature enabled
	No retest option selected
	File size set to 65536 kB
	Record Size 4 kB
	SYNC Mode. 
	Multi_buffer. Work area 16777216 bytes
	Command line used: iozone -ecI -+n -L64 -S32 -s64m -r4k -i0 -i2 -l8 -u8 -o -m -t8 -F /mnt/file1 /mnt/file2 /mnt/file3 /mnt/file4 /mnt/file5 /mnt/file6 /mnt/file7 /mnt/file8
	Output is in kBytes/sec
	Time Resolution = 0.000001 seconds.
	Processor cache size set to 32 kBytes.
	Processor cache line size set to 64 bytes.
	File stride size set to 17 * record size.
	Min process = 8 
	Max process = 8 
	Throughput test with 8 processes
	Each process writes a 65536 kByte file in 4 kByte records

	Children see throughput for  8 initial writers 	=    4251.39 kB/sec
	Parent sees throughput for  8 initial writers 	=    4208.91 kB/sec
	Min throughput per process 			=     530.02 kB/sec 
	Max throughput per process 			=     535.52 kB/sec
	Avg throughput per process 			=     531.42 kB/sec
	Min xfer 					=   64864.00 kB

	Children see throughput for 8 random readers 	=  163663.69 kB/sec
	Parent sees throughput for 8 random readers 	=  163434.76 kB/sec
	Min throughput per process 			=   20050.63 kB/sec 
	Max throughput per process 			=   21585.89 kB/sec
	Avg throughput per process 			=   20457.96 kB/sec
	Min xfer 					=   60848.00 kB

	Children see throughput for 8 random writers 	=    4073.73 kB/sec
	Parent sees throughput for 8 random writers 	=    4057.25 kB/sec
	Min throughput per process 			=     508.44 kB/sec 
	Max throughput per process 			=     510.97 kB/sec
	Avg throughput per process 			=     509.22 kB/sec
	Min xfer 					=   65212.00 kB



iozone test complete.
jetson2 /~#

I’ve just ordered a NX dev kit, I’m wondering what M.2 Key-M 2280 NVMe SSDs are compatible. In the Design guide it specifies the 3.3V rail to only supply max 2.6watts. Was the drive using a low power profile when benchmarked? what model did you use when testing?

“Samsung 950 PRO 256GB SSD (MZ-V5P256BW) V-NAND, M.2 NVM Express” which is the same one I have in my desktop/development machine. Got to be about 5 years old.

The spec says… “Average: 5.1 Watts, Idle : 70mW” but I don’t know how to tell what it’s currently using. The speed results were the same on my desktop as the NX though.

Hi gtj,

Below is our test result for NVMe read/write performance:

Seq Write: 126 MB/s
Seq Read: 237 MB/s

Test with Intel 256GB NVMe.

Just for records, here are the results for Corsair MP510 240GB (NVMe PCIe Gen x4 M.2 SSD)
sudo iozone -ecI -+n -L64 -S32 -s64m -r4k -i0 -i2 -l8 -u8 -o -m -t8 -F /mnt/file1 /mnt/file2 /mnt/file3 /mnt/file4 /mnt/file5 /mnt/file6 /mnt/file7 /mnt/file8
Iozone: Performance Test of File I/O
Version Revision: 3.429
Compiled for 64 bit mode.
Build: linux

Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
             Al Slater, Scott Rhine, Mike Wisner, Ken Goss
             Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
             Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
             Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
             Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
             Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer,
             Vangel Bojaxhi, Ben England, Vikentsi Lapa.

Run began: Thu May 21 20:06:55 2020

Include fsync in write timing
Include close in write timing
O_DIRECT feature enabled
No retest option selected
File size set to 65536 kB
Record Size 4 kB
SYNC Mode. 
Multi_buffer. Work area 16777216 bytes
Command line used: iozone -ecI -+n -L64 -S32 -s64m -r4k -i0 -i2 -l8 -u8 -o -m -t8 -F /mnt/file1 /mnt/file2 /mnt/file3 /mnt/file4 /mnt/file5 /mnt/file6 /mnt/file7 /mnt/file8
Output is in kBytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 32 kBytes.
Processor cache line size set to 64 bytes.
File stride size set to 17 * record size.
Min process = 8 
Max process = 8 
Throughput test with 8 processes
Each process writes a 65536 kByte file in 4 kByte records

Children see throughput for  8 initial writers 	=    9140.81 kB/sec
Parent sees throughput for  8 initial writers 	=    9136.90 kB/sec
Min throughput per process 			=    1142.09 kB/sec 
Max throughput per process 			=    1142.85 kB/sec
Avg throughput per process 			=    1142.60 kB/sec
Min xfer 					=   65496.00 kB

Children see throughput for 8 random readers 	=  190896.49 kB/sec
Parent sees throughput for 8 random readers 	=  190497.06 kB/sec
Min throughput per process 			=   22496.33 kB/sec 
Max throughput per process 			=   26319.76 kB/sec
Avg throughput per process 			=   23862.06 kB/sec
Min xfer 					=   55516.00 kB

Children see throughput for 8 random writers 	=    9693.90 kB/sec
Parent sees throughput for 8 random writers 	=    9598.86 kB/sec
Min throughput per process 			=    1201.26 kB/sec 
Max throughput per process 			=    1218.60 kB/sec
Avg throughput per process 			=    1211.74 kB/sec
Min xfer 					=   64608.00 kB

I’ve been messing with iozone for “forever” and I still can’t make heads or tails of the results and what the results mean in real life. :)

I’m sticking with dd :)

jetson2 /~# dd if=/dev/zero of=/dev/nvme0n1p3 bs=4K count=1000000
1000000+0 records in
1000000+0 records out
4096000000 bytes (4.1 GB, 3.8 GiB) copied, 6.10829 s, 671 MB/s
jetson2 /~# 

Here’s the results of Samsung 981 on NX devkit (I think the dd results are skewed):

root@nx-tegra194:/# dd if=/dev/zero of=/dev/nvme0n1p2 bs=4K count=1000000
1000000+0 records in
1000000+0 records out
4096000000 bytes (4.1 GB, 3.8 GiB) copied, 53.9752 s, 75.9 MB/s

and iozone3:

	Command line used: iozone -ecI -+n -L64 -S32 -s64m -r4k -i0 -i2 -l8 -u8 -o -m -t8 -F /mnt/file1 /mnt/file2 /mnt/file3 /mnt/file4 /mnt/file5 /mnt/file6 /mnt/file7 /mnt/file8
Output is in kBytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 32 kBytes.
Processor cache line size set to 64 bytes.
File stride size set to 17 * record size.
Min process = 8 
Max process = 8 
Throughput test with 8 processes
Each process writes a 65536 kByte file in 4 kByte records

Children see throughput for  8 initial writers 	=    5003.57 kB/sec
Parent sees throughput for  8 initial writers 	=    5001.63 kB/sec
Min throughput per process 			=     625.35 kB/sec 
Max throughput per process 			=     625.63 kB/sec
Avg throughput per process 			=     625.45 kB/sec
Min xfer 					=   65508.00 kB

Children see throughput for 8 random readers 	=  184847.29 kB/sec
Parent sees throughput for 8 random readers 	=  184562.06 kB/sec
Min throughput per process 			=   21840.28 kB/sec 
Max throughput per process 			=   27732.65 kB/sec
Avg throughput per process 			=   23105.91 kB/sec
Min xfer 					=   51588.00 kB

Children see throughput for 8 random writers 	=    5305.45 kB/sec
Parent sees throughput for 8 random writers 	=    5292.98 kB/sec
Min throughput per process 			=     662.01 kB/sec 
Max throughput per process 			=     664.01 kB/sec
Avg throughput per process 			=     663.18 kB/sec
Min xfer 					=   65340.00 kB

-albertr

Hi,

We notice application(iozone in this case) has to use preadv2/pwritev2 call to use RWF_HIPRI flag. Details at https://lwn.net/Articles/670231/. Otherwise, there would be high context switching in cpu.

Not sure if iozone has implemented preadv2 and pwritev2 yet. You could try this patch and should improve the performance.

    diff --git a/fs/direct-io.c b/fs/direct-io.c
index c19155f..4b2abf3 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -457,8 +457,7 @@
 		__set_current_state(TASK_UNINTERRUPTIBLE);
 		dio->waiter = current;
 		spin_unlock_irqrestore(&dio->bio_lock, flags);
-		if (!(dio->iocb->ki_flags & IOCB_HIPRI) ||
-		    !blk_poll(bdev_get_queue(dio->bio_bdev), dio->bio_cookie))
+		if (!blk_poll(bdev_get_queue(dio->bio_bdev), dio->bio_cookie))
 			io_schedule();
 		/* wake up sets us TASK_RUNNING */
 		spin_lock_irqsave(&dio->bio_lock, flags);

The patch didn’t really improve things for iozone…

Before:

Children see throughput for  8 initial writers 	=    4315.54 kB/sec
Parent sees throughput for  8 initial writers 	=    4308.09 kB/sec
Min throughput per process 			=     538.49 kB/sec 
Max throughput per process 			=     540.10 kB/sec
Avg throughput per process 			=     539.44 kB/sec
Min xfer 					=   65340.00 kB

Children see throughput for 8 random readers 	=  237274.24 kB/sec
Parent sees throughput for 8 random readers 	=  236959.83 kB/sec
Min throughput per process 			=   27950.16 kB/sec 
Max throughput per process 			=   31743.83 kB/sec
Avg throughput per process 			=   29659.28 kB/sec
Min xfer 					=   57848.00 kB

Children see throughput for 8 random writers 	=    4425.41 kB/sec
Parent sees throughput for 8 random writers 	=    4414.20 kB/sec
Min throughput per process 			=     552.43 kB/sec 
Max throughput per process 			=     554.19 kB/sec
Avg throughput per process 			=     553.18 kB/sec
Min xfer 					=   65332.00 kB

After:

Children see throughput for  8 initial writers 	=    4661.47 kB/sec
Parent sees throughput for  8 initial writers 	=    4652.70 kB/sec
Min throughput per process 			=     582.07 kB/sec 
Max throughput per process 			=     583.53 kB/sec
Avg throughput per process 			=     582.68 kB/sec
Min xfer 					=   65372.00 kB

Children see throughput for 8 random readers 	=  237757.85 kB/sec
Parent sees throughput for 8 random readers 	=  237275.72 kB/sec
Min throughput per process 			=   25240.70 kB/sec 
Max throughput per process 			=   36438.22 kB/sec
Avg throughput per process 			=   29719.73 kB/sec
Min xfer 					=   45488.00 kB

Children see throughput for 8 random writers 	=    4781.74 kB/sec
Parent sees throughput for 8 random writers 	=    4761.44 kB/sec
Min throughput per process 			=     595.89 kB/sec 
Max throughput per process 			=     599.42 kB/sec
Avg throughput per process 			=     597.72 kB/sec
Min xfer 					=   65152.00 kB

It DID improve 4K block writes with dd though:

Before:

jetson2 /mnt# dd if=/dev/zero of=.ddtest bs=4K count=40000 oflag=direct
40000+0 records in
40000+0 records out
163840000 bytes (164 MB, 156 MiB) copied, 2.21959 s, 73.8 MB/s

After:

jetson2 /mnt# dd if=/dev/zero of=.ddtest bs=4K count=100000 oflag=direct
100000+0 records in
100000+0 records out
409600000 bytes (410 MB, 391 MiB) copied, 2.96477 s, 138 MB/s

Hi gtj,

It sounds like the test app may affect. Could you use below command along with the patch and try again?

iozone -ecI -+n -L64 -S32 -r4k -i0 -i1 -i2 -s500m -f <path/to/output/file>

So, it appears the results from a good NVMe ssd (970 EVO Plus) on NX are identical to x86 (using gnome-disks, 100 10MiB samples) and to the devkit.

Compared to an sd card, it’s a pretty significant difference.

Note that these are just raw reads/writes and not a filesystem benchmark. Individual filesystem results will vary.

In my case is weird, I have a Pioneer 500gb gen3, and only can get the speed of PC on reading, the people on PC get twice my writting speed , 1Gb/s
Another weird thing is the disk gives on PC the double of performance announced, but in my case only on reading, in writing it fits to the announced.
Anyway fast enough