Jetson TK1 latency too high


I was trying to perform some latency and bandwidth related experiments on kit as the previous board ( CARMA ) had some limitations related to same ( Latency : 100 microsecond , Bandwidth: max achievable: 450 mbytes )

Our experiment shows that bandwidth improved significantly ( that is obvious as PCI limitation is gone )
But latency came out as surprise for us. Latency for launching async calls ( Kernel launch , memcpyasyn) is in range of ( 400 microsec ).

Latency experiment setup:

– Host to device transfer and vice versa of pinned memory of size 4 bytes
– Simple kernel launch

The numbers are taken from command profiler which shows cputime given above.

Why is the latency so high in Jetson Kit. Is there any confirugarion I have missed or it is a known issue?

Did you set CPU and GPU clocks to maximum before doing the experiments?

Thanks for the input. I tried setting gpu and cpu to max performance.

The latency decreased from 400 microsec to 50 microsec ( which is better than carma board )

But still higher than what we get in x86 based system ( ~10 microsec ).

I had one more query related to bandwidth. How do I measure the theoretical bandwidth expected in Jetson kit for H2D and D2H transfer?

Earlier it was limited by PCIe bandwidth but how do I calculate the expected bandwidth for inbuilt gpu?

I’m not familiar with CUDA but Tegra has unified memory so maybe the bandwidth is not as relevant on Tegra as it is with dedicated GPU memory?

“Tegra K1 supports Unified Memory, however in contrast to current desktop / server GPUs, the memory on Tegra is physically unified. However, there are separate GPU and CPU caches.”

Jetson’s GPU uses the same physical memory as the CPU…a dedicated video card or dedicated GPU would probably have its own memory which would be much faster. So imagine if on x86 you had a GPU with no memory of its own…unless you do that it is an “apples-and-oranges” comparison.

True. So I did some quick calculation of speed of DDR3 RAM which is common to both CPU and GPU in Jetson.

Theoretical bandwidth= Freq * Transfer_per_clock * Width
= 933 MHz * 2 * 64
= ~ 15 GByte / sec )

The numbers of bandwidth test are as follows:

H2D max bandwidth: 0.998 MB/sec
D2H max bandwidth: 5.4 MB/sec
D2D max bandwidth: 11.6 MB/sec

D2D reaches near to max bandwidth achievable. Host to device is very slow while device to host reaches 50% f max. Any idea on why H2D is so slow as compared to other type of transfer?

Sorry a typo in above numbers. Instead of MB/sec they are GB/sec

The numbers of bandwidth test are as follows:

H2D max bandwidth: 0.998 GB/sec
D2H max bandwidth: 5.4 GB/sec
D2D max bandwidth: 11.6 GB/sec

It appears there is little reason to copy memory on Jetson.

copying global memory to shared thread block memory does not give the speed up you’d expect on the desktop GPUs.

I’ve tried a few simple kernels and the atomic operations on global memory worked faster on Jetson than the shared memory approach. You lose a lot of time doing data duplication…

Article on Zero Copy was interesting. But won’t duplication problem be existent in both D2H and H2D?

From the explanation given in article it means H2D == D2D and the new pointer is returned as both CPU and GPU memory is same. So ideally both should have same bandwidth. Correct me in case my understanding is wrong.

I don’t know how to explain the difference between different types of transfers. I would expect all three to be roughly the same with the shared mem.

I’m noticing some other anomalies in performance. For example running 1M C2C cuFFT in a loop appears to be CPU bound. It simply pegs one of the ARM CPUs to 100%. Very strange…