Is there any other way to accelerate TV_L1 optical flow calculation?

Hello ,nvidia

Many thanks for the reply in last post, and indeed,I’m doing a research about optimizing TVL1 opticalflow(Cuda).
as we know, TVL1 algorithm use implicit expression to converge U V.It leads that former result is request when calculating current one.

the TVL1 algorithm loop might be like this:

for(4 pyramid)
			/*main loop*/
			for(4 warping)
				Kernel<1>(calc auxiliary variable)					(input : intensiy(d_I0x d_I0y d_I1x d_I1y) 
				for(n<50 when total error <TH BREAK!)													 flow(d_u,d_v) gradient(d_I1wx d_I1wy).. )
					Kernel<2>(calc U)								(input : rho(d_rho) gradient(d_grad2) flow(d_u,d_v) ... )
					**Kernel<3>(calc total image error)					(input : d_err, d_terr)
					Kernel<4>(dual V by U)							(input : divrgence(d_p11 d_p12 d_p21 d_p22), tau,gamma)

I tried to changed gpumat to array.and rewrite cuda::resize & cuda::multiply into one kernel.
there is 1ms accelerated by these effort(CV 6ms, MY 5ms [2 frame(224*224 CV32F)]).

And also I used float16 and int32 instead of float32 format precise,but no matter which global variable I changed (float16 format)
the final flow performed not good.After one pyramid loop, I cant get the correct answer but the float32 could

is there any other way to accelerate? (definately,I also optimized parameter such as Loop_num, pyramid_num warping_num tau gamma which are mentioned at ), or did I use float16 incorrectly?

I’m considering that divide (224224)image into (77)(32*32) partition to calculate d_U d_V individually so that I can use shared memory to save variable instead of global one.Since calculating opticalflow only need the surrounding data.

any help would be greatly appreciated.

Hi haifengli,

The DeepStream SDK 3.0 has been released, please migrate your project on this version to take the advantage of those hardware-accelerated plugins: