Vectorizing image processing in OpenCL on NVidia

Hello,

I’m writing a kernel that reads float4 pixels using read_image_f(…).

Therefore, each thread does many scalar operations that could be processed in parallel.

I understand that if the workgroup is large enough, then the throughput would be similar even if float4 operations are done sequentially.
However, if workgroup needs to be small (i.e. only small number of pixels can be at the same workgroup), there would be a significant performance degradation (if float4 operations are not done by 4 threads each).

Is there an option to force the compiler to run each logical thread on 4 physical work-items?
Thus, with a workgroup size of 16, I can get the occupancy that I would get with workgroup size of 64.

If not, would it be worth to do that optimization explicitly? (i.e. read each float4 to local memory and process each channel by a different work-item)

Any idea?
Thanks in advance!

Yoav

float4 operations are done in a single kernel. No way around that due to register allocation, synchronization and visibility. A workgroup size of 16 is way way wrong. It’s half a warp so you are always wasting at least of the clock cycles on threads that are doing nothing. 64 is very very small unless you use a lot of ILP (instruction level parallelism). Anything less and you are on the wrong platform. 256 is more appropriate for most kernels.