I was trying to perform some latency and bandwidth related experiments on kit as the previous board ( CARMA ) had some limitations related to same ( Latency : 100 microsecond , Bandwidth: max achievable: 450 mbytes )
Our experiment shows that bandwidth improved significantly ( that is obvious as PCI limitation is gone )
But latency came out as surprise for us. Latency for launching async calls ( Kernel launch , memcpyasyn) is in range of ( 400 microsec ).
Latency experiment setup:
– Host to device transfer and vice versa of pinned memory of size 4 bytes
– Simple kernel launch
The numbers are taken from command profiler which shows cputime given above.
Why is the latency so high in Jetson Kit. Is there any confirugarion I have missed or it is a known issue?
“Tegra K1 supports Unified Memory, however in contrast to current desktop / server GPUs, the memory on Tegra is physically unified. However, there are separate GPU and CPU caches.”
Jetson’s GPU uses the same physical memory as the CPU…a dedicated video card or dedicated GPU would probably have its own memory which would be much faster. So imagine if on x86 you had a GPU with no memory of its own…unless you do that it is an “apples-and-oranges” comparison.
H2D max bandwidth: 0.998 MB/sec
D2H max bandwidth: 5.4 MB/sec
D2D max bandwidth: 11.6 MB/sec
D2D reaches near to max bandwidth achievable. Host to device is very slow while device to host reaches 50% f max. Any idea on why H2D is so slow as compared to other type of transfer?
copying global memory to shared thread block memory does not give the speed up you’d expect on the desktop GPUs.
I’ve tried a few simple kernels and the atomic operations on global memory worked faster on Jetson than the shared memory approach. You lose a lot of time doing data duplication…
Article on Zero Copy was interesting. But won’t duplication problem be existent in both D2H and H2D?
From the explanation given in article it means H2D == D2D and the new pointer is returned as both CPU and GPU memory is same. So ideally both should have same bandwidth. Correct me in case my understanding is wrong.
I don’t know how to explain the difference between different types of transfers. I would expect all three to be roughly the same with the shared mem.
I’m noticing some other anomalies in performance. For example running 1M C2C cuFFT in a loop appears to be CPU bound. It simply pegs one of the ARM CPUs to 100%. Very strange…