I’m writing a kernel that reads float4 pixels using read_image_f(…).
Therefore, each thread does many scalar operations that could be processed in parallel.
I understand that if the workgroup is large enough, then the throughput would be similar even if float4 operations are done sequentially.
However, if workgroup needs to be small (i.e. only small number of pixels can be at the same workgroup), there would be a significant performance degradation (if float4 operations are not done by 4 threads each).
Is there an option to force the compiler to run each logical thread on 4 physical work-items?
Thus, with a workgroup size of 16, I can get the occupancy that I would get with workgroup size of 64.
If not, would it be worth to do that optimization explicitly? (i.e. read each float4 to local memory and process each channel by a different work-item)
Thanks in advance!