Slow USB3.0 Throughput

Hello everyone,

I tested USB 3.0 throughput on my TK1 R21.4, but the result is slow.

my test instruction list below:
sync; echo 3 | sudo tee /proc/sys/vm/drop_caches sync; sudo dd if=/dev/sdb of=/dev/null bs=500K count=1024

in test several times, TK1 throughput is 65.0 - 45.0 MB/s,
same method on PC throughput can reach 91.0 - 85.0 MB/s.

I am not sure what reason cause low throughput, settings or driver ?
could someone can help me or share your USB3.0 throughput ?

Thank you.

From what I can see the drop_caches could decrease performance for a short time after running the echo operation on it. If you ran the drop_caches echo operation each time the dd test was run your results would slow down for some object creations and defeat the purpose of defragmenting parts of memory. Running the drop_caches just once and then running dd multiple times would probably be beneficial. Was the drop_caches echo operation run each time dd ran, or was it run only once and then the test run multiple times?

Also, was performance mode enabled? See:
http://elinux.org/Jetson/Performance#Maximizing_CPU_performance

The test itself does not seem to isolate USB3, it looks like you’re also limited by the SATA device and drivers (including cache on the hard drive itself)…the USB could be limiting throughput, but it might also be other things. It would be interesting to be able to isolate parts of the test and actually know which part is specifically the bottleneck.

Hi linuxdev,

Thank you for your reply,

ubuntu@tegra-ubuntu:/media/ubuntu/TK1_4G/TK1$ sync; echo 3 | sudo tee /proc/sys/vm/drop_caches
3
ubuntu@tegra-ubuntu:/media/ubuntu/TK1_4G/TK1$ sync; sudo dd if=/dev/sda of=/dev/null bs=500K count=1024
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 6.91324 s, 75.8 MB/s
ubuntu@tegra-ubuntu:/media/ubuntu/TK1_4G/TK1$ sync; sudo dd if=/dev/sda of=/dev/null bs=500K count=1024
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 0.227908 s, 2.3 GB/s
ubuntu@tegra-ubuntu:/media/ubuntu/TK1_4G/TK1$ sync; sudo dd if=/dev/sda of=/dev/null bs=500K count=1024
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 0.208973 s, 2.5 GB/s
ubuntu@tegra-ubuntu:/media/ubuntu/TK1_4G/TK1$ sync; sudo dd if=/dev/sda of=/dev/null bs=500K count=1024
1024+0 records in
1024+0 records out
524288000 bytes (524 MB) copied, 0.20719 s, 2.5 GB/s

It seem not real read from the device after test twice.

Yes, The test is running under Max CPU performance mode.

You are right, I should be to buy a USB3 interface SSD doing the test or any suggest for me to do?

Thank you.

Eliminating memory fragmentation also clears cache, so this is why reading more than once after drop_caches improves performance (cache gets rebuilt). It’s also an interesting way to supercharge a diskless Beowulf cluster with read-only mounts on a master node.

I’m not sure of the best way to test USB3 throughput. One big problem is that there are lots of communications channels with high burst rate, and no ability to sustain the rate. SATA3 is very fast on paper, but only that fast in bursts…maintaining such speed requires the hard drive behind it to maintain throughput. Cache within the drive itself helps, but sustaining a burst for something like a half terabyte is a far different story.

If you were to use an insanely fast RAID0 array you could truly test USB3, at least for bulk transfer mode. I don’t know how many disks that would require, but even with an SSD it would take “many”. For random access hardware RAID and/or SSDs would likely help, for linear access of large amounts of streamed data (huge files like video streams, versus things like web servers hitting random locations) you’d want software RAID, and there may even be an advantage non-SSD hard drives if the data is not fragmented on the disk (dd of an entire disk guarantees no fragmentation, whereas reading files on a file system could get in the way).

There are also other USB modes, such as isochronous (versus bulk mode of a hard drive) for video and audio streaming, which may behave differently. Video devices should transfer in isochronous mode, but some do not.

Hi All,

I brought a HDD (BUFFALO PGD 1TB USB3.0) for test, It contain 1G DRAM buffer to increase write performance (can be above 300MB/s)

MacBook Pro 339MB/s : https://drive.google.com/open?id=0B473fLYyTmUsVnlnaXpTbl93OTQ
PC Linux 318MB/s : https://drive.google.com/open?id=0B473fLYyTmUsTV9nVGFEblZ1dXc
nVIDIA TK1 143MB/s : https://drive.google.com/open?id=0B473fLYyTmUsWEVFUnVETDlPMlk

TK1 throughput max only 14x MB/s.
or other test method can prove TK1 USB3 throughput is in normal range ?

Does lsusb show the drive is in USB3 mode? See “lsusb -t”. If so, it does seem there should be higher transfer rate. To be complete you may want to also dd from the drive as a data source redirected to “/dev/null”…both directions for completeness.

One thing I do wonder about is that memory access and cache can be quite different across those three architectures; I don’t know how clearing caches might have more effect on one architecture than another. It would still be good (because of caches) to run the test twice, once with sync and drop_caches, then once with sync but no drop_caches.

Something else to throw into the test for I/O performance without cache or USB3 getting in the way would be interesting. I am wondering about the performance on each platform of:

dd if=/dev/urandom of=/dev/null bs=500K count=2000

…see what moving data within the kernel does…USB3 would be a combination of kernel throughput and USB3 controller throughput. This somewhat narrows part of the performance testing to something more specific which is a component of USB3.

FYI, that hard drive is probably a good real world performance test, but the 1GB cache has to be randomly reading from the same part of the disk (1GB worth) to provide increased read performance. Read performance would go up in the case of randomly accessing the same 1GB. For example a performance test of reading the same 1GB part of the disk 10 times would be greater performance than reading 10GB of sectors which are otherwise exact copies but in different locations (as soon as you read from a different 1GB location the 1GB cache invalidates even if the regions are exact copies…the disk does not know they are the same).

The effect of the 1GB cache on write performance depends on how fast the buffer is written to disk…I would bet you could put more stress on the system with multiple reads of the same 1GB of a partition versus with writes just from the way that 1GB of cache works. One issue I can see with that might be that using a command line to perform multiple 1GB reads will result in opening and closing a descriptor over and over. In the case of a single program which opens the descriptor once and does the same multiple reads without closing the descriptor you would get max performance without open/close overhead.

I’m sorry for forget turn on cpu performance mode.
update throughput, nVIDIA TK1 208MB/s.

I agree cache and random issue will effect test result, just a way to check throughput.
If have better way can reduce effect factor showing real throughput, I can try it.

208 is much better, but probably could go higher as well. I’d still be interested in the dd test of /dev/urandom to /dev/null device on each platform type.

dd if=/dev/urandom of=/dev/null bs=500K count=2000
drop_caches or not in my test is no different.

MacBook Pro 13.3MB/s
PC Linux 17.2MB/s
nVIDIA TK1 8.2MB/s

What’s fascinating about that is that the urandom test does not even use USB, yet it’s much slower. urandom must be very inefficient…going through USB speeds it up quite a bit. Perhaps “/dev/zero” as a source would be faster? Zero has no algorithm, it just spits out bytes of NULL.