I have ran the multiplication samples, specifically the matrixMulCUBLAS which is supposed to be the most performance optimized. But the results are rather disappointing (or atleast for me they are).
My target application is image stitching, which would include image alignment (warping) , image stitching and image blending. All these operations on three 1920x1080 colored images. With the performance like this (as seen below) I don’t think thats going to be possible to pull-off in realtime.
Am I wrong in my judgement ? If so then kindly give me pointers to what I should read. If I am not wrong, then my question is would TX1 be able to handle this load ?
[Matrix Multiply CUBLAS] - Starting... GPU Device 0: "GK20A" with compute capability 3.2 MatrixA(640,1280), MatrixB(640,1280), MatrixC(640,1280) Computing result using CUBLAS...done. Performance= 150.89 GFlop/s, Time= 6.949 msec, Size= 1048576000 Ops