How to transmit data from a computer to Jetson TX1 as fast as possible?

@linuxdev,

For the hard drive of Jetson TX1, eMMC 5.1, it has a sequential read/write speed of: 250/125 MB/s, in theory, will it support gigabit ethernet?

I have doubts that you would get close to theoretical eMMC speeds. 250MB/s would be 2Gbit/s. At best you’ll get a burst (and we’re talking milliseconds) near peak. Averages are a quite different story. Mostly I do not believe gigabit would be challenged by eMMC speeds.

Don’t forget…you’re also limited by the receiving end. Something like scp or ftp will not request data from the sender until it is ready. You have to consider the max continuous average speed at both ends and use the lower value for receive and transmit as your expectation.

@linuxdev,

I find the theoretical eMMC 5.1 speeds online.

Could you please explain “Averages are a quite different story” in detail? Thanks.

Most every disk has some form of cache. Performance differs greatly depending on whether there is a cache hit or miss. This implies that so far as cache benefits go random access might have cache hits quite often…or not. Reading the entire disk in a continuous fashion (such as via dd) will not ever hit cache (cache is still filled, and then burst for greater speed…but cache will never be read twice in this case and so re-filling the cache will always occur in this linear access).

Although not part of the disk itself, there is an effect of system ram being used as cache as well (one reason I like the detailed output of xosview…I recommend examining the ram listing under xosview…enlarge the gui window and watch the details as you run find on “/”…time the find, then time the find again a second time in a row). To some extent the system will cache/buffer disk reads and writes in main ram until something needs ram (the system tries to not waste ram and caches/buffers disk read/write until something else needs the ram, and then it reassigns the ram to whatever else needs it…performance will change depending on what other processes do to interfere with buffer/cache). Assuming you just wrote a file, and then go to read it back immediately, there is a possibility that both disk cache and system buffer will speed things up. However, this is not a guarantee, it depends on order of reading and writing and what other parts of the system may be using the disk in some independent operation (a disk dedicated to a single data purpose under a single program will be faster than a disk used for the operating system because cache and buffer will never be a miss due to other unrelated processes).

Example:

xosview &
# Enlarge the gui to see mem better.
sudo time find / > /dev/null
# Watch the buff/cache parts of xosview.
# Compare a faster time for an immediate second run.
sudo time find / > /dev/null
# You could compile a kernel and do some web browsing, and then time again...
# ...time will go back up some depending on pressure to use ram and changes to disk.

The best a disk can do is rarely consistent. SSD versus standard hard drive also changes things. It is rare that you are actually measuring only the disk. You could somewhat isolate the disk itself and see a more “guaranteed” minimum performance if you use dd to read the entire disk from start to end (you could time it redirected to “/dev/null”, and then again piped through netcat…the end of netcat which receives the data could be timed once saving the dd to a file, and again redirecting to the remote system’s “/dev/null”).

One thing to think about if comparing a desktop system’s hard drive performance to any ARM architecture is that desktops are usually arranged to allow hardware interrupts to service relevant hardware drivers from any CPU core. ARM will only service hardware device interrupts from CPU0. This means IRQ starvation will degrade any multi-core ARM device faster than on a multi-core desktop as the number of interrupts go up to the point of CPU0 load getting collisions between competing drivers. This is a good example why hardware drivers should do as little work as possible and be as fast as possible while handing off other work to user space or kernel space work which can be done on other cores. This will of course not matter if there are not a lot of hardware drivers competing under a heavy load.

@linuxdev,

Thanks for the detailed answer. I observed cache increases a lot if sudo time find runs the second time. The average speed of I/O on hard drive also depends on its cache. If we have a fast data pipe, but a slow hard drive, transmitted data are cached, but hit/miss is random. Therefore, in practice, the speed of fast data pipe are slower than the theoretical value. Could you please tell me whether I get your point or not?

I’m not sure if I’m reading it right, but the practical effect is that even if theoretical drive speeds are higher and even if it looks like the gigabit ethernet might need an upgrade, this is seldom the case because of practical reasons. If your file transfer needs approach something like a web server where the same file is downloaded over and over but never changes, then you could actually use something faster than gigabit. A similar situation exists for read-only NFS mounts.

I’m just guessing, but it seems probably you are interested in sending files which change, that the content is not static, and the files are large. For buffer to help you’d need larger system ram relative to the size of the files being buffered; you’d also have to write and read the file immediately, or else read the file twice before the buffer would help. Hard drive built-in cache would help more on random access where you often hit the same location multiple times, and it was never discussed whether files are read in a single linear operation, or whether there would be random access to different offsets in the file. Certainly ftp and scp just do a single linear access…it’d be some other special program use which does random access instead of linear access; this implies if and only if you are using the data immediately and files are not necessary prior to data transfer you’d be better off with a custom program which networks the data right when it is created (you wouldn’t write it to disk until it is on the remote machine if possible).

If you are using large csv files I imagine you may be doing some database work. Some databases, e.g., PostgreSQL, have methods of setting up distributed systems with options related to how tightly coupled they are (meaning how close to fully synchronous they are…fully synchronous for safety under failure, or the converse, buffered and synchronized over time to get better responsiveness). If you are using a full database system like PostgreSQL you might consider using the native tools for synchronizing instead of file transfers.

@linuxdev,

Could you please also explain why write is much slower than read? Thanks.

@linuxdev,

Thanks for your detailed answer. For me, the first step is to transmit a large file(.txt or .csv, several GB) that does not change as fast as possible from a computer to Jetson TX1.

According to the link: https://devtalk.nvidia.com/default/topic/912497/emmc-5-1-slow-sequential-write/

The write and read speed of eMMC 5.1 on Jetson TX1 is 60 and 235 MB/s when the CPU and eMMC frequencies are set to the maximum. Based on the chain rule you have mentioned, can we imply that it is very difficult to achieve file transmission at speed of 100MB/s (even if we have very fast pipe)?

Usually writes are slower in part because of the drive itself being slower at write. Add to this that unless the disk cache says the write is not really a change, then it will always be a cache miss. How often would you write the same data twice to the same exact spot on the drive, versus how often you read the same thing twice in a row?

Yes, it is very difficult with a single drive to achieve 100MB/s (800Mbit/s) even with a very fast pipe. Achieving higher values for bursts is much easier than achieving this continuously.

If you have a file which is not updated often, but is often read, mount it on a drive separate from the operating system. This way you avoid having cache misses just because the operating system was reading random things all over the drive which would invalidate the cache line. If the separate drive does nothing but serve mostly the same data, then you will get more cache hits. If the file is truly read-only, things will go even faster by mounting the partition read-only. You’re still better off if you can generate the data and transmit it over the network without ever writing it to disk.

One place where a multi-core desktop will always outperform ARM architecture is when multiple threads are executing multiple hardware drivers on different cores at the same time. For example, one core has a driver reading a hard drive while another core is running the ethernet…on ARM a single core is context switching from one hardware driver to the other with a single core; both hardware drivers never execute simultaneously…the desktop is capable of executing both drivers simultaneously without context switching using separate CPU cores. Now if you add software RAID on a desktop with many CPU cores, then you may end up reading 6 drives simultaneously using 6 cores, while the ARM architecture just context switches more often (ARM becomes hardware IRQ starved and can thrash instead of scaling well). This is somewhat of a contrived example since DMA can alter how it works.

@Linuxdev,

Thanks a lot for your detailed answer.

Because hard drive I/O is much slower than the speed of Gigabit Ethernet, so if a file is sent from a computer’s hard drive to Jetson TX1’s hard drive, the speed will be limited by hardware I/O. Therefore, we need to try our best to do not get hard drive evolved. Am I correct?

According to the reply #29, is there any possibility to exploit the multicore architecture on both computer and Jetson TX1?

  1. Correct.

  2. Multicore on the desktop is already doing ok at spreading out work load for hardware IRQ servicing. Things which are good for ARM architecture are also good for the desktop, but the desktop will not suffer nearly as much as situations of more hardware interrupts and increased I/O goes up compared to ARM. It’s always good to do the fastest (most minimal) hardware IRQ servicing possible, and leave any other processing to user space or to software tasks not requiring CPU0. Give CPU0 as little to do as possible so that it can always service a driver right away. Unless you are working on a driver though this will be likely out of your control.

@linuxdev,

Thanks for your detailed answer.

I am sorry that I did not quite get why “Give CPU0 as little to do as possible so that it can always service a driver right away”. What about other CPUs?

As you mentioned about socket TCP, should the socket size be as large as possible, so that the data can be transmitted faster?

@linuxdev,

Sorry for going back to the previous point #24, I did not understand clearly about average speed. Also why at best there will be a burst at beginning? Thank you very much.

The cache has a high speed burst. Once you’ve used something beyond cache, it has to come from the drive itself. I suppose if you read the same small file over and over (something smaller than cache), then you might get a lot of good throughput. However, you are using large files, even under the best of circumstances cache will never cover the whole file. System buffer can increase this, but that won’t be consistent.

Additional note: TCP sends in packets. There is overhead to each packet. Sending or receiving a packet implies a hardware interrupt triggering a driver, and if one has to wait for the interrupt to be serviced, it only makes sense to service as much as possible with a minimal number of interrupts. This is where jumbo frames come in (mentioned earlier in the thread).