I am trying to understand if my problem could benefit from CUDA.
The problem is not very suitable for a SIMD formulation. I must execute the same code on hundreds/thousands of different sets of data, but the code contains several branching instructions. If I understand well, when two threads in the same block take different branches, they are not any more executed in parallel but serialized, so the benefit would get lost.
The point I am missing is: how many different “instruction units” are there, say, on a Tesla G870? Or, to say in different words: how many threads taking different branches can be executed in parallel?
Thank you in advance.