Jetson TK1 performance bottleneck

mdotali · February 10, 2016, 2:18pm

I have ran the multiplication samples, specifically the matrixMulCUBLAS which is supposed to be the most performance optimized. But the results are rather disappointing (or atleast for me they are).

My target application is image stitching, which would include image alignment (warping) , image stitching and image blending. All these operations on three 1920x1080 colored images. With the performance like this (as seen below) I don’t think thats going to be possible to pull-off in realtime.

Am I wrong in my judgement ? If so then kindly give me pointers to what I should read. If I am not wrong, then my question is would TX1 be able to handle this load ?

[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "GK20A" with compute capability 3.2

MatrixA(640,1280), MatrixB(640,1280), MatrixC(640,1280)
Computing result using CUBLAS...done.
Performance= 150.89 GFlop/s, Time= 6.949 msec, Size= 1048576000 Ops

njuffa · February 10, 2016, 3:41pm

I have had only minor exposure to image processing, but would assume that the operations you list are limited by memory bandwidth, not computational throughput. On the other hand, assuming that the example you reference uses CUBLAS *GEMM for matrix multiplication, this is a compute-bound task.

So to find out whether any particular processor can “handle this load” you would want to determine how many FLOPS and how much memory bandwidth is required to process an 1920x1080 RGBA image in 1/60 second, which is what I assume “real time” means. Then compare that to the specified FLOPS and bandwidth for the processor, under consideration that real-life peak performance may be 75% of the theoretical peak, for either metric.

To be sure, a TK1 makes for a low-end system compared to a full-fledged PC with a high-end GPU such as a Titan X. The performance difference is probably on the order of 20x both in terms of computational throughput and memory bandwidth. They certainly serve very different purposes.

Robert_Crovella · February 10, 2016, 3:44pm

You may want to check the various clocks on your jetson:

[url]Jetson/Performance - eLinux.org

The peak theoretical throughput for single-precision floating point operations on jetson is ~326GFlops:

[url]http://elinux.org/Jetson_TK1[/url]

(192 x 2 x 0.852GHz)

Due to the Kepler architecture, you might only experience ~66-75% efficiency, so an observed peak of ~200GFlops may be possible. Your measurement of ~150 is therefore “in the ballpark” but there may be some improvements possible.

TX1 has two Maxwell SMs with 256 cores, and a core clock closer to 1GHz, so the advertised peak number is ~500GFlops:

[url]http://www.nvidia.com/object/jetson-tx1-module.html[/url]

(The 1 TF number assumes half precision, not single precision, floating point).

If you can construct your problem to work primarily in half precision (e.g. cublasHgemm):

[url]https://devblogs.nvidia.com/parallelforall/new-features-cuda-7-5/[/url]

TX1 could provide a substantial boost (~4x) otherwise you might expect a ~2x boost out of TX1, for this gemm test case. The Maxwell SM should have somewhat better efficiency (actual as a percentage of peak) as compared to Kepler.

These are all just ballpark estimates, YMMV. Probably others will chime in. This thread may be of interest:

[url]https://devtalk.nvidia.com/default/topic/752076/jetson-tk1-performance-issues/?offset=3[/url]

njuffa · February 10, 2016, 4:03pm

As for the TK1’s available memory bandwidth, I cannot find the official specification right now, but this thread contains multiple messages indicating that its measured peak throughput is around 12-13 GB/sec:

[url]https://devtalk.nvidia.com/default/topic/754874/jetson-tk1/tk1-memory-bandwidth/[/url]

allanmac · February 10, 2016, 7:44pm

@modtali,

It’s possible the TK1 was clocking the GPU at one of its lower frequencies during your benchmark:

Controlling_GPU_performance

Topic		Replies	Views
Cuda 7.0 Jetson TX1 performance and benchmarks Jetson TX1	21	17176	March 16, 2017
Jetson TK1 performance issues Jetson TK1	2	1907	June 10, 2014
is there any max performance turnning script for TK1? Jetson TK1	2	785	October 18, 2021
Execution Time on Stitching for Jetson TK1 using OpenCV 3.2 - stitching_detailed.cpp Jetson TK1	13	1557	October 18, 2021
Performance comparision TK1 vs TX1 Jetson TX1	6	3970	October 18, 2021
TX1 vs TK1 CPU Jetson TX1	7	21035	December 17, 2015
CPU performance problem on Jetson TX1 Jetson TX1	23	3209	October 18, 2021
TX1 slower than TK1 Jetson TX1	5	1314	August 19, 2016
Jetson TX2 Performance CUDA Programming and Performance	12	24582	December 23, 2017
K1 and cuFFT Jetson TK1	5	2590	June 10, 2014

Jetson TK1 performance bottleneck

Related topics