Thread Block with single thread

I have an algorithm where I will have one thread per column in an image. There will be branching based on what data the columns have so there will be high divergence. For this reason, I am wondering if it is better to have a thread block per column with a single thread (with 31 wasted threads), or if it would still be better to just have a thread per column and pay the divergence cost.

What performance did you measure when you tried these two alternatives? I fail to see the point of asking third parties for thought experiments, when it is possible to just try it either way, using the actual code.

With just a vague description it is not clear how the code actually works in context, but one possible perspective is this: In the presence of thread divergence, in the worst case, only one thread per warp runs at any given time. In the alternative case, you are guaranteed that only one thread per warp will run at any given time. So it would seem better to take the chance of divergence.

You should probably rethink the algorithm design top to bottom to avoid these alternatives, as neither of them is an attractive choice.

I would write code for both and then run a profiler (e.g., NVVP) to see if they run at different speeds.

That said, finding a different algorithm that avoids the divergence would be better.

Could you instead allocate a warp per column?