Hello ,nvidia

Many thanks for the reply in last post, and indeed,I’m doing a research about optimizing TVL1 opticalflow(Cuda).

as we know, TVL1 algorithm use implicit expression to converge U V.It leads that former result is request when calculating current one.

the TVL1 algorithm loop might be like this:

```
for(4 pyramid)
{
/*main loop*/
for(4 warping)
{
Kernel<1>(calc auxiliary variable) (input : intensiy(d_I0x d_I0y d_I1x d_I1y)
for(n<50 when total error <TH BREAK!) flow(d_u,d_v) gradient(d_I1wx d_I1wy).. )
{
Kernel<2>(calc U) (input : rho(d_rho) gradient(d_grad2) flow(d_u,d_v) ... )
**Kernel<3>(calc total image error) (input : d_err, d_terr)
Kernel<4>(dual V by U) (input : divrgence(d_p11 d_p12 d_p21 d_p22), tau,gamma)
}
}
}
```

I tried to changed gpumat to array.and rewrite cuda::resize & cuda::multiply into one kernel.

there is 1ms accelerated by these effort(CV 6ms, MY 5ms [2 frame(224*224 CV32F)]).

And also I used float16 and int32 instead of float32 format precise,but no matter which global variable I changed (float16 format)

the final flow performed not good.After one pyramid loop, I cant get the correct answer but the float32 could

is there any other way to accelerate? (definately,I also optimized parameter such as Loop_num, pyramid_num warping_num tau gamma which are mentioned at https://stackoverflow.com/questions/19309567/speeding-up-optical-flow-createoptflow-dualtvl1 ), or did I use float16 incorrectly?

I’m considering that divide (224*224)image into (7*7)(32*32) partition to calculate d_U d_V individually so that I can use shared memory to save variable instead of global one.Since calculating opticalflow only need the surrounding data.

any help would be greatly appreciated.