Can CUDA be useful for me?

I am trying to understand if my problem could benefit from CUDA.

The problem is not very suitable for a SIMD formulation. I must execute the same code on hundreds/thousands of different sets of data, but the code contains several branching instructions. If I understand well, when two threads in the same block take different branches, they are not any more executed in parallel but serialized, so the benefit would get lost.

The point I am missing is: how many different “instruction units” are there, say, on a Tesla G870? Or, to say in different words: how many threads taking different branches can be executed in parallel?

Thank you in advance.

Branch divergence is typically not as much of a performance problem as some people imagine.

Threads within a warp (32 threads) that diverge are serialized, but the hardware takes care of this and it is transparent from the programmers point of view.

I would recommend implementing your algorithm and finding out!

So, it’s a little bit better than that. There is one instruction decoder per multiprocessor and a single Tesla card (I think you mean the C870) has 16 multiprocessors. However, the scheduling unit on the card is the warp, which on current devices is 32 threads. All threads in a warp must be running the same instruction, or the warp will be segmented into several warps with no-op slots for some of the threads. The hardware is also very good about rejoining warps after branch points when possible. (Not sure how they pull that off…) Different warps in the same block can be running different instructions with no penalty.

There is only performance loss when not branching on warp boundaries. Even then, it depends on how much of your kernel is spent in the branching portion. A little branching can be quite acceptable the bulk of your processing time is spent in non-branching code.

Thank you very much for your replies.

So with 16 multiprocessor/instruction decoders I should always have at least 16 threads executing in parallel, no matter where they are branching to. Good!

I’ll have a try with a Tesla C870.

Best regards.

Well, at least 1 thread per multiprocessor, which means those 16 threads will not be from the same block.