Hello ,nvidia
Many thanks for the reply in last post, and indeed,I’m doing a research about optimizing TVL1 opticalflow(Cuda).
as we know, TVL1 algorithm use implicit expression to converge U V.It leads that former result is request when calculating current one.
the TVL1 algorithm loop might be like this:
for(4 pyramid)
{
/*main loop*/
for(4 warping)
{
Kernel<1>(calc auxiliary variable) (input : intensiy(d_I0x d_I0y d_I1x d_I1y)
for(n<50 when total error <TH BREAK!) flow(d_u,d_v) gradient(d_I1wx d_I1wy).. )
{
Kernel<2>(calc U) (input : rho(d_rho) gradient(d_grad2) flow(d_u,d_v) ... )
**Kernel<3>(calc total image error) (input : d_err, d_terr)
Kernel<4>(dual V by U) (input : divrgence(d_p11 d_p12 d_p21 d_p22), tau,gamma)
}
}
}
I tried to changed gpumat to array.and rewrite cuda::resize & cuda::multiply into one kernel.
there is 1ms accelerated by these effort(CV 6ms, MY 5ms [2 frame(224*224 CV32F)]).
And also I used float16 and int32 instead of float32 format precise,but no matter which global variable I changed (float16 format)
the final flow performed not good.After one pyramid loop, I cant get the correct answer but the float32 could
is there any other way to accelerate? (definately,I also optimized parameter such as Loop_num, pyramid_num warping_num tau gamma which are mentioned at https://stackoverflow.com/questions/19309567/speeding-up-optical-flow-createoptflow-dualtvl1 ), or did I use float16 incorrectly?
I’m considering that divide (224224)image into (77)(32*32) partition to calculate d_U d_V individually so that I can use shared memory to save variable instead of global one.Since calculating opticalflow only need the surrounding data.
any help would be greatly appreciated.