one or more kernels algorithm implementation matching GPU

The kernel consists of the following phases:

  1. all threads perform following (texture reads + computations) 10 times (each of the 10 iterations uses most of the values already read from texture at previous iteration. This is valid for next steps too.)
    2562 texture reads + SAD(sum of absolute dif)
    2.only some threads perform 3 or 5 times:
    2 texture reads + SAD
    3.only some threads from step 2 perform 3 or 5 times:
    256*2 texture reads + SAD
  2. all threads perform 16 times:
    256*2 texture reads + SAD

Above, by threads I mean data sets for which the computation is required to be performed.
This “only some” I can implement using “if” statements inside a single kernel, but that would cause thread divergence. Other thing would be to have 4 different kernels, but I don’t know how to call them on just the required parts of the data, and if possible to keep all processors busy. For this second idea, just some of the threads from step 1 would survive for step 2, and then the same for step 3. To distribute them among the gpu processors, I would have to select/sort and then generate another grid of blocks to match their number. (for steps 2 and 3)

What do you recommend. (either of the above sugestions, or some other solutions.) Any sugestions are appreciated, since I’m pretty new to GPU programming.

Is it better to have one longer or more smaller kernels, with better convergence?

I recommend experimenting to find what is best performance for your app. (i.e. it’s hard to predict.)