Jetson TK1 performance bottleneck

I have ran the multiplication samples, specifically the matrixMulCUBLAS which is supposed to be the most performance optimized. But the results are rather disappointing (or atleast for me they are).

My target application is image stitching, which would include image alignment (warping) , image stitching and image blending. All these operations on three 1920x1080 colored images. With the performance like this (as seen below) I don’t think thats going to be possible to pull-off in realtime.

Am I wrong in my judgement ? If so then kindly give me pointers to what I should read. If I am not wrong, then my question is would TX1 be able to handle this load ?

[Matrix Multiply CUBLAS] - Starting...
GPU Device 0: "GK20A" with compute capability 3.2

MatrixA(640,1280), MatrixB(640,1280), MatrixC(640,1280)
Computing result using CUBLAS...done.
Performance= 150.89 GFlop/s, Time= 6.949 msec, Size= 1048576000 Ops

I have had only minor exposure to image processing, but would assume that the operations you list are limited by memory bandwidth, not computational throughput. On the other hand, assuming that the example you reference uses CUBLAS *GEMM for matrix multiplication, this is a compute-bound task.

So to find out whether any particular processor can “handle this load” you would want to determine how many FLOPS and how much memory bandwidth is required to process an 1920x1080 RGBA image in 1/60 second, which is what I assume “real time” means. Then compare that to the specified FLOPS and bandwidth for the processor, under consideration that real-life peak performance may be 75% of the theoretical peak, for either metric.

To be sure, a TK1 makes for a low-end system compared to a full-fledged PC with a high-end GPU such as a Titan X. The performance difference is probably on the order of 20x both in terms of computational throughput and memory bandwidth. They certainly serve very different purposes.

You may want to check the various clocks on your jetson:

http://elinux.org/Jetson/Performance#Controlling_GPU_performance

The peak theoretical throughput for single-precision floating point operations on jetson is ~326GFlops:

http://elinux.org/Jetson_TK1

(192 x 2 x 0.852GHz)

Due to the Kepler architecture, you might only experience ~66-75% efficiency, so an observed peak of ~200GFlops may be possible. Your measurement of ~150 is therefore “in the ballpark” but there may be some improvements possible.

TX1 has two Maxwell SMs with 256 cores, and a core clock closer to 1GHz, so the advertised peak number is ~500GFlops:

http://www.nvidia.com/object/jetson-tx1-module.html

(The 1 TF number assumes half precision, not single precision, floating point).

If you can construct your problem to work primarily in half precision (e.g. cublasHgemm):

https://devblogs.nvidia.com/parallelforall/new-features-cuda-7-5/

TX1 could provide a substantial boost (~4x) otherwise you might expect a ~2x boost out of TX1, for this gemm test case. The Maxwell SM should have somewhat better efficiency (actual as a percentage of peak) as compared to Kepler.

These are all just ballpark estimates, YMMV. Probably others will chime in. This thread may be of interest:

https://devtalk.nvidia.com/default/topic/752076/jetson-tk1-performance-issues/?offset=3

As for the TK1’s available memory bandwidth, I cannot find the official specification right now, but this thread contains multiple messages indicating that its measured peak throughput is around 12-13 GB/sec:

https://devtalk.nvidia.com/default/topic/754874/jetson-tk1/tk1-memory-bandwidth/

@modtali,

It’s possible the TK1 was clocking the GPU at one of its lower frequencies during your benchmark:

Controlling_GPU_performance