From what I can see the drop_caches could decrease performance for a short time after running the echo operation on it. If you ran the drop_caches echo operation each time the dd test was run your results would slow down for some object creations and defeat the purpose of defragmenting parts of memory. Running the drop_caches just once and then running dd multiple times would probably be beneficial. Was the drop_caches echo operation run each time dd ran, or was it run only once and then the test run multiple times?
The test itself does not seem to isolate USB3, it looks like you’re also limited by the SATA device and drivers (including cache on the hard drive itself)…the USB could be limiting throughput, but it might also be other things. It would be interesting to be able to isolate parts of the test and actually know which part is specifically the bottleneck.
Eliminating memory fragmentation also clears cache, so this is why reading more than once after drop_caches improves performance (cache gets rebuilt). It’s also an interesting way to supercharge a diskless Beowulf cluster with read-only mounts on a master node.
I’m not sure of the best way to test USB3 throughput. One big problem is that there are lots of communications channels with high burst rate, and no ability to sustain the rate. SATA3 is very fast on paper, but only that fast in bursts…maintaining such speed requires the hard drive behind it to maintain throughput. Cache within the drive itself helps, but sustaining a burst for something like a half terabyte is a far different story.
If you were to use an insanely fast RAID0 array you could truly test USB3, at least for bulk transfer mode. I don’t know how many disks that would require, but even with an SSD it would take “many”. For random access hardware RAID and/or SSDs would likely help, for linear access of large amounts of streamed data (huge files like video streams, versus things like web servers hitting random locations) you’d want software RAID, and there may even be an advantage non-SSD hard drives if the data is not fragmented on the disk (dd of an entire disk guarantees no fragmentation, whereas reading files on a file system could get in the way).
There are also other USB modes, such as isochronous (versus bulk mode of a hard drive) for video and audio streaming, which may behave differently. Video devices should transfer in isochronous mode, but some do not.
Does lsusb show the drive is in USB3 mode? See “lsusb -t”. If so, it does seem there should be higher transfer rate. To be complete you may want to also dd from the drive as a data source redirected to “/dev/null”…both directions for completeness.
One thing I do wonder about is that memory access and cache can be quite different across those three architectures; I don’t know how clearing caches might have more effect on one architecture than another. It would still be good (because of caches) to run the test twice, once with sync and drop_caches, then once with sync but no drop_caches.
Something else to throw into the test for I/O performance without cache or USB3 getting in the way would be interesting. I am wondering about the performance on each platform of:
…see what moving data within the kernel does…USB3 would be a combination of kernel throughput and USB3 controller throughput. This somewhat narrows part of the performance testing to something more specific which is a component of USB3.
FYI, that hard drive is probably a good real world performance test, but the 1GB cache has to be randomly reading from the same part of the disk (1GB worth) to provide increased read performance. Read performance would go up in the case of randomly accessing the same 1GB. For example a performance test of reading the same 1GB part of the disk 10 times would be greater performance than reading 10GB of sectors which are otherwise exact copies but in different locations (as soon as you read from a different 1GB location the 1GB cache invalidates even if the regions are exact copies…the disk does not know they are the same).
The effect of the 1GB cache on write performance depends on how fast the buffer is written to disk…I would bet you could put more stress on the system with multiple reads of the same 1GB of a partition versus with writes just from the way that 1GB of cache works. One issue I can see with that might be that using a command line to perform multiple 1GB reads will result in opening and closing a descriptor over and over. In the case of a single program which opens the descriptor once and does the same multiple reads without closing the descriptor you would get max performance without open/close overhead.
What’s fascinating about that is that the urandom test does not even use USB, yet it’s much slower. urandom must be very inefficient…going through USB speeds it up quite a bit. Perhaps “/dev/zero” as a source would be faster? Zero has no algorithm, it just spits out bytes of NULL.