Anyway, here’s the iozone results which are not very impressive. :)
jetson2 /~# nvpmodel -m 0
NVPM WARN: patching tpc_pg_mask: (0x1:0x2)
NVPM WARN: patched tpc_pg_mask: 0x2
jetson2 /~# jetson_clocks
jetson2 /~# iozone -ecI -+n -L64 -S32 -s64m -r4k -i0 -i2 -l8 -u8 -o -m -t8 -F /mnt/file1 /mnt/file2 /mnt/file3 /mnt/file4 /mnt/file5 /mnt/file6 /mnt/file7 /mnt/file8
Iozone: Performance Test of File I/O
Version $Revision: 3.429 $
Compiled for 64 bit mode.
Build: linux
Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
Al Slater, Scott Rhine, Mike Wisner, Ken Goss
Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer,
Vangel Bojaxhi, Ben England, Vikentsi Lapa.
Run began: Sun May 17 21:10:01 2020
Include fsync in write timing
Include close in write timing
O_DIRECT feature enabled
No retest option selected
File size set to 65536 kB
Record Size 4 kB
SYNC Mode.
Multi_buffer. Work area 16777216 bytes
Command line used: iozone -ecI -+n -L64 -S32 -s64m -r4k -i0 -i2 -l8 -u8 -o -m -t8 -F /mnt/file1 /mnt/file2 /mnt/file3 /mnt/file4 /mnt/file5 /mnt/file6 /mnt/file7 /mnt/file8
Output is in kBytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 32 kBytes.
Processor cache line size set to 64 bytes.
File stride size set to 17 * record size.
Min process = 8
Max process = 8
Throughput test with 8 processes
Each process writes a 65536 kByte file in 4 kByte records
Children see throughput for 8 initial writers = 4251.39 kB/sec
Parent sees throughput for 8 initial writers = 4208.91 kB/sec
Min throughput per process = 530.02 kB/sec
Max throughput per process = 535.52 kB/sec
Avg throughput per process = 531.42 kB/sec
Min xfer = 64864.00 kB
Children see throughput for 8 random readers = 163663.69 kB/sec
Parent sees throughput for 8 random readers = 163434.76 kB/sec
Min throughput per process = 20050.63 kB/sec
Max throughput per process = 21585.89 kB/sec
Avg throughput per process = 20457.96 kB/sec
Min xfer = 60848.00 kB
Children see throughput for 8 random writers = 4073.73 kB/sec
Parent sees throughput for 8 random writers = 4057.25 kB/sec
Min throughput per process = 508.44 kB/sec
Max throughput per process = 510.97 kB/sec
Avg throughput per process = 509.22 kB/sec
Min xfer = 65212.00 kB
iozone test complete.
jetson2 /~#
I’ve just ordered a NX dev kit, I’m wondering what M.2 Key-M 2280 NVMe SSDs are compatible. In the Design guide it specifies the 3.3V rail to only supply max 2.6watts. Was the drive using a low power profile when benchmarked? what model did you use when testing?
“Samsung 950 PRO 256GB SSD (MZ-V5P256BW) V-NAND, M.2 NVM Express” which is the same one I have in my desktop/development machine. Got to be about 5 years old.
The spec says… “Average: 5.1 Watts, Idle : 70mW” but I don’t know how to tell what it’s currently using. The speed results were the same on my desktop as the NX though.
Just for records, here are the results for Corsair MP510 240GB (NVMe PCIe Gen x4 M.2 SSD)
sudo iozone -ecI -+n -L64 -S32 -s64m -r4k -i0 -i2 -l8 -u8 -o -m -t8 -F /mnt/file1 /mnt/file2 /mnt/file3 /mnt/file4 /mnt/file5 /mnt/file6 /mnt/file7 /mnt/file8
Iozone: Performance Test of File I/O
Version Revision: 3.429
Compiled for 64 bit mode.
Build: linux
Contributors:William Norcott, Don Capps, Isom Crawford, Kirby Collins
Al Slater, Scott Rhine, Mike Wisner, Ken Goss
Steve Landherr, Brad Smith, Mark Kelly, Dr. Alain CYR,
Randy Dunlap, Mark Montague, Dan Million, Gavin Brebner,
Jean-Marc Zucconi, Jeff Blomberg, Benny Halevy, Dave Boone,
Erik Habbinga, Kris Strecker, Walter Wong, Joshua Root,
Fabrice Bacchella, Zhenghua Xue, Qin Li, Darren Sawyer,
Vangel Bojaxhi, Ben England, Vikentsi Lapa.
Run began: Thu May 21 20:06:55 2020
Include fsync in write timing
Include close in write timing
O_DIRECT feature enabled
No retest option selected
File size set to 65536 kB
Record Size 4 kB
SYNC Mode.
Multi_buffer. Work area 16777216 bytes
Command line used: iozone -ecI -+n -L64 -S32 -s64m -r4k -i0 -i2 -l8 -u8 -o -m -t8 -F /mnt/file1 /mnt/file2 /mnt/file3 /mnt/file4 /mnt/file5 /mnt/file6 /mnt/file7 /mnt/file8
Output is in kBytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 32 kBytes.
Processor cache line size set to 64 bytes.
File stride size set to 17 * record size.
Min process = 8
Max process = 8
Throughput test with 8 processes
Each process writes a 65536 kByte file in 4 kByte records
Children see throughput for 8 initial writers = 9140.81 kB/sec
Parent sees throughput for 8 initial writers = 9136.90 kB/sec
Min throughput per process = 1142.09 kB/sec
Max throughput per process = 1142.85 kB/sec
Avg throughput per process = 1142.60 kB/sec
Min xfer = 65496.00 kB
Children see throughput for 8 random readers = 190896.49 kB/sec
Parent sees throughput for 8 random readers = 190497.06 kB/sec
Min throughput per process = 22496.33 kB/sec
Max throughput per process = 26319.76 kB/sec
Avg throughput per process = 23862.06 kB/sec
Min xfer = 55516.00 kB
Children see throughput for 8 random writers = 9693.90 kB/sec
Parent sees throughput for 8 random writers = 9598.86 kB/sec
Min throughput per process = 1201.26 kB/sec
Max throughput per process = 1218.60 kB/sec
Avg throughput per process = 1211.74 kB/sec
Min xfer = 64608.00 kB
Here’s the results of Samsung 981 on NX devkit (I think the dd results are skewed):
root@nx-tegra194:/# dd if=/dev/zero of=/dev/nvme0n1p2 bs=4K count=1000000
1000000+0 records in
1000000+0 records out
4096000000 bytes (4.1 GB, 3.8 GiB) copied, 53.9752 s, 75.9 MB/s
and iozone3:
Command line used: iozone -ecI -+n -L64 -S32 -s64m -r4k -i0 -i2 -l8 -u8 -o -m -t8 -F /mnt/file1 /mnt/file2 /mnt/file3 /mnt/file4 /mnt/file5 /mnt/file6 /mnt/file7 /mnt/file8
Output is in kBytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 32 kBytes.
Processor cache line size set to 64 bytes.
File stride size set to 17 * record size.
Min process = 8
Max process = 8
Throughput test with 8 processes
Each process writes a 65536 kByte file in 4 kByte records
Children see throughput for 8 initial writers = 5003.57 kB/sec
Parent sees throughput for 8 initial writers = 5001.63 kB/sec
Min throughput per process = 625.35 kB/sec
Max throughput per process = 625.63 kB/sec
Avg throughput per process = 625.45 kB/sec
Min xfer = 65508.00 kB
Children see throughput for 8 random readers = 184847.29 kB/sec
Parent sees throughput for 8 random readers = 184562.06 kB/sec
Min throughput per process = 21840.28 kB/sec
Max throughput per process = 27732.65 kB/sec
Avg throughput per process = 23105.91 kB/sec
Min xfer = 51588.00 kB
Children see throughput for 8 random writers = 5305.45 kB/sec
Parent sees throughput for 8 random writers = 5292.98 kB/sec
Min throughput per process = 662.01 kB/sec
Max throughput per process = 664.01 kB/sec
Avg throughput per process = 663.18 kB/sec
Min xfer = 65340.00 kB
We notice application(iozone in this case) has to use preadv2/pwritev2 call to use RWF_HIPRI flag. Details at https://lwn.net/Articles/670231/. Otherwise, there would be high context switching in cpu.
Not sure if iozone has implemented preadv2 and pwritev2 yet. You could try this patch and should improve the performance.
diff --git a/fs/direct-io.c b/fs/direct-io.c
index c19155f..4b2abf3 100644
--- a/fs/direct-io.c
+++ b/fs/direct-io.c
@@ -457,8 +457,7 @@
__set_current_state(TASK_UNINTERRUPTIBLE);
dio->waiter = current;
spin_unlock_irqrestore(&dio->bio_lock, flags);
- if (!(dio->iocb->ki_flags & IOCB_HIPRI) ||
- !blk_poll(bdev_get_queue(dio->bio_bdev), dio->bio_cookie))
+ if (!blk_poll(bdev_get_queue(dio->bio_bdev), dio->bio_cookie))
io_schedule();
/* wake up sets us TASK_RUNNING */
spin_lock_irqsave(&dio->bio_lock, flags);
The patch didn’t really improve things for iozone…
Before:
Children see throughput for 8 initial writers = 4315.54 kB/sec
Parent sees throughput for 8 initial writers = 4308.09 kB/sec
Min throughput per process = 538.49 kB/sec
Max throughput per process = 540.10 kB/sec
Avg throughput per process = 539.44 kB/sec
Min xfer = 65340.00 kB
Children see throughput for 8 random readers = 237274.24 kB/sec
Parent sees throughput for 8 random readers = 236959.83 kB/sec
Min throughput per process = 27950.16 kB/sec
Max throughput per process = 31743.83 kB/sec
Avg throughput per process = 29659.28 kB/sec
Min xfer = 57848.00 kB
Children see throughput for 8 random writers = 4425.41 kB/sec
Parent sees throughput for 8 random writers = 4414.20 kB/sec
Min throughput per process = 552.43 kB/sec
Max throughput per process = 554.19 kB/sec
Avg throughput per process = 553.18 kB/sec
Min xfer = 65332.00 kB
After:
Children see throughput for 8 initial writers = 4661.47 kB/sec
Parent sees throughput for 8 initial writers = 4652.70 kB/sec
Min throughput per process = 582.07 kB/sec
Max throughput per process = 583.53 kB/sec
Avg throughput per process = 582.68 kB/sec
Min xfer = 65372.00 kB
Children see throughput for 8 random readers = 237757.85 kB/sec
Parent sees throughput for 8 random readers = 237275.72 kB/sec
Min throughput per process = 25240.70 kB/sec
Max throughput per process = 36438.22 kB/sec
Avg throughput per process = 29719.73 kB/sec
Min xfer = 45488.00 kB
Children see throughput for 8 random writers = 4781.74 kB/sec
Parent sees throughput for 8 random writers = 4761.44 kB/sec
Min throughput per process = 595.89 kB/sec
Max throughput per process = 599.42 kB/sec
Avg throughput per process = 597.72 kB/sec
Min xfer = 65152.00 kB
It DID improve 4K block writes with dd though:
Before:
jetson2 /mnt# dd if=/dev/zero of=.ddtest bs=4K count=40000 oflag=direct
40000+0 records in
40000+0 records out
163840000 bytes (164 MB, 156 MiB) copied, 2.21959 s, 73.8 MB/s
After:
jetson2 /mnt# dd if=/dev/zero of=.ddtest bs=4K count=100000 oflag=direct
100000+0 records in
100000+0 records out
409600000 bytes (410 MB, 391 MiB) copied, 2.96477 s, 138 MB/s
In my case is weird, I have a Pioneer 500gb gen3, and only can get the speed of PC on reading, the people on PC get twice my writting speed , 1Gb/s
Another weird thing is the disk gives on PC the double of performance announced, but in my case only on reading, in writing it fits to the announced.
Anyway fast enough