Jetson: Writing external data directly to GPU's RAM

Hello,

I have to process data every ~300usec.
In order to save time I want to write data directly to GPU’s RAM.
The size of data: 15MB in one scenario and 1.5GB in another.

Is it possible doing it with Jetson series ?

Thank you,
Zvika

Hello,

The TK2 has 8GB memory but only 4 PCIe GEN2 lanes.
This means:
4 x 5Gb/Sec = 20Gb/sec = 2.5GB/sec (this is a theoretical rate)

I need ~45GB/sec in both scenarios.

Thank you,
Zvika

(NOTE: I assume you mean “TX2” instead of “TK2”)

The memory is unified, so it means there is no separate GPU memory, but you have options on how it is mapped. I am thinking you won’t be able to achieve the rate you want.

PCIe rating is in “GT/s”, or “giga transfers/sec”, which is a bit rate (not byte rate) for encoded data…decoding data loses some of the encoding information and not all bits are for your data, but instead some bits are for insuring data integrity. Given PCIe rev 2 speed is “5GT/s” per lane using an 8b/10b encoding, then this is equivalent to four times the rev 2 single direction speed of 5GT/s after accounting for encoding:

5Gbit/s * 8/10 * 4 = 16Gbit/s theoretical max throughput.
== 16Gbit/s * (1 byte/8 bits) == 2GB/s

So even if you really mean gigabit instead of gigabyte four lanes of PCIe cannot transfer even close to that speed. You might see some references to double this, but this is misleading because it’s just saying the PCIe devices work with full duplex.

The Jetson Xavier (which is announced, but not yet available) would make your possibility of those rates go up dramatically:
https://developer.nvidia.com/jetson-xavier-devkit

The Xavier will have a 8-lane PCIe, and if your PCIe device can handle rev 3 speeds, then you get a combination of twice as many lanes, higher speed, and more efficient 128b/130b encoding:

(8.5GT/s * 8 * 128/130) == 67 Gbit/s theoretical actual data throughput.
* (1 byte/8 bit) = 8.3 GB/s throughput.

So far I do not know of any devices yet reaching PCIe rev 4 speeds, but the Xavier will have rev 4 and be one of the first devices supporting this. If you manage to use a rev 4 device, then:

(16GT/s * 8 * 128/130) == 126 Gbit/s theoretical actual data throughput.
* (1 byte/8 bit) = 15.75 GB/s throughput.

If you could actually get the throughput of the PCIe on the TX2, then there are a lot of reasons why you would still have trouble with getting things to run with the latency you are requiring. You might give more details of your specific use-case in order to get a more realistic answer for that specific case, but it is unlikely you will get the latency you want even within given TX2 PCIe transfer rate capabilities.

NVIDIA rarely releases exact dates, but it was suggested early access will hit for the Xavier in August, and perhaps actual distribution in the wild in September.

Hello,

Thank you very much for the detailed answer.

In this project I can use any high end GPU, not necessarily Jetson series.
Do you think there is a high end GPU that is closer to the rate I need ?

Can you please specify what further information is required on this use case ?

Best regards,
Zvika

I’m thinking maybe units are being mixed up, e.g., gigabyte (“GB”) versus gigabit (“Gb”). What do you actually need? Instead of talking about a single frame of 15MB or 1.5GB what is the specific hardware? A camera? Stereo cameras? Something providing sensor data like a camera? Why the 300 usec limit?

You can find a lot of products out on the market right now which have a very high average throughput. I think if you have a hard latency maximum of 300 usec though you won’t achieve it without custom hardware. A more practical latency would be that after the first frame is processed perhaps 1 or 2 ms latency.

Right now the fastest PCIe device (so far as average data throughput goes) is PCIe rev. 3 (8.5GT/s per lane). Rev. 4 (16GT/s per lane) is coming out quite soon, but I don’t know of anything yet capable of using this (when hosts come out with rev. 4 capability we’ll probably see more devices for this revision, but it won’t be cheap).

FYI, GPUs excel at multiple threads (thousands) of simultaneous execution. Within a thread it is probably a bit slower than CPU execution. Whether you run a single operation on 100 bytes of data or thousands of operations on 100 bytes a thousand times the GPU will end up with a result in the same time and won’t slow down with extra threads. But that single operation time is unlikely to get 300 usec.

Hello Linuxdev, All,

There is no mix in the units.
15MB every 300usec → ~45GB/Sec (GigaByte)
1.5GB every 30msec → ~45GB/Sec

This is not a camera. It’s a sensor writing data every TBD time (e.g 300 usec)

Your answers are highly appreciated,
Best regards,
Zvika

Let’s assume you have a full sized desktop PC with PCIe rev. 3 dedicated GPU using 16 lanes (PCIe v3 x16). Bandwidth is like this:

(8.5GT/s * 16 * 128/130) == 133.9 Gbit/s theoretical actual data throughput.
* (1 byte/8 bit) = 16.7 GB/s throughput.

Percent towards goal:

16.7/45 = 37%

The required average bandwidth of existing maximum revision 3 PCIe x16 is only capable of about 1/3 of that throughput. So it isn’t even remotely possible even on a desktop PC if the PCIe bus is what you are using. Convert this to PCIe v2 x4 of a TX2 and throughput is an even tinier fraction of what you need. So the answer is that it isn’t possible without exotic hardware.

There is a possibility of getting much faster speed if you use a GPU with dedicated VRAM and you are operating without copying data to/from the outside world. A discrete GPU with its on VRAM can essentially talk to CUDA cores much faster than it can with PCIe. However, even with those more expensive discrete products getting a realtime 300 usec latency would be difficult at best. It’s just a feel, but probably the 1 or 2 ms latency range is the best you’ll get. On the other hand, if you scaled it up to much larger data (e.g., instead of 15 MB submitted many times you use 1.5 GB once) latency won’t result in it slowing down much (or perhaps it won’t slow down at all…the GPU is good at divide and conquer).

If you want to see an example of products probably not able to use 300 usec latency to the outside world over the best PCIe you can get look at some of these:
https://www.google.com/search?q=nvidia+%22tesla%22&source=univ&tbm=shop&tbo=u&sa=X&ved=0ahUKEwjky-6C8aPcAhUs4oMKHVMXC84QsxgIJg

PCIe rev. 4 is about to come out, and if you can afford it, then revision 4 approximately doubles PCIe v3…and you’re then at about 2/3 of your goal for average throughput, but still doubtful you’d get the latency you want.

All of this is just generic information though since it isn’t known what you’re working on. An example might be that if it is a video stream there is a possibility of hardware compression. Hardware compression can be an extremely good improvement on average throughput (though it might drop a frame now and then when the image frame isn’t suitable for good compression).

Another example of a way to improve things (but which you’ve not given enough information to say if it applies to your case) is to batch 1.5 GB of data together instead of sending it in a lot of 15 MB chunks. This probably isn’t possible in your case, but the only reason I say this is that you put a 300 usec latency requirement on the 15 MB data chunks.