Fine grain threading, correct logic?

This is a question about fine grain parallel logic.

I am very new to CUDA and am experimenting using it to perform some image processing.

A contrived algorithm might look like the one below. I’ve suffixed with comments to show intended multi-threading for parallel execution.

For all pixel cells  <--  thread blocks 


  For all of some loop <--  parallelize into say 8 threads 


	data = getSomeData

	if( !data ) continue;  <-- Same number of loops not always needed

	Vec4 a = b + c	   <--  parallelize as 4 sub threads (x,y,z,w)

	Vec4 v = mat4x4 * a  <--  parallelize as 4 sub threads as possible



So I might be able to process using blocks of 32 threads where a block processes eg. 16x16 cells of the image then uses 8 threads to parallelize a main loop, then use 4 sub threads per 8 to accelerate various math operations.

You could expect from this scenario that a portion of the time many of the 32 threads per block will not be doing useful work, but will not necessarily be delayed by divergence.

I expect I only need to insert __syncthreads() when I need the result of one parallel operation to be completed so the result can be shared between threads using some share memory.

I’m asking is it reasonable to run extra threads just to parallelize some small loops or vector values?