Can CUDA be useful for me?

bite · June 12, 2008, 4:04pm

I am trying to understand if my problem could benefit from CUDA.

The problem is not very suitable for a SIMD formulation. I must execute the same code on hundreds/thousands of different sets of data, but the code contains several branching instructions. If I understand well, when two threads in the same block take different branches, they are not any more executed in parallel but serialized, so the benefit would get lost.

The point I am missing is: how many different “instruction units” are there, say, on a Tesla G870? Or, to say in different words: how many threads taking different branches can be executed in parallel?

Thank you in advance.

Simon_Green · June 12, 2008, 5:01pm

Branch divergence is typically not as much of a performance problem as some people imagine.

Threads within a warp (32 threads) that diverge are serialized, but the hardware takes care of this and it is transparent from the programmers point of view.

I would recommend implementing your algorithm and finding out!

seibert · June 12, 2008, 5:03pm

So, it’s a little bit better than that. There is one instruction decoder per multiprocessor and a single Tesla card (I think you mean the C870) has 16 multiprocessors. However, the scheduling unit on the card is the warp, which on current devices is 32 threads. All threads in a warp must be running the same instruction, or the warp will be segmented into several warps with no-op slots for some of the threads. The hardware is also very good about rejoining warps after branch points when possible. (Not sure how they pull that off…) Different warps in the same block can be running different instructions with no penalty.

There is only performance loss when not branching on warp boundaries. Even then, it depends on how much of your kernel is spent in the branching portion. A little branching can be quite acceptable the bulk of your processing time is spent in non-branching code.

bite · June 12, 2008, 7:43pm

Thank you very much for your replies.

So with 16 multiprocessor/instruction decoders I should always have at least 16 threads executing in parallel, no matter where they are branching to. Good!

I’ll have a try with a Tesla C870.

Best regards.

E.D_Riedijk · June 12, 2008, 8:43pm

Well, at least 1 thread per multiprocessor, which means those 16 threads will not be from the same block.

Topic		Replies	Views
How many divergent branches can actually be discussed in parallel? CUDA Programming and Performance	5	3119	October 1, 2009
Branch divergence and executing serial could be misinterpretted. CUDA Programming and Performance	8	4082	December 21, 2016
Branching in kernel CUDA Programming and Performance	3	5414	June 5, 2008
Cost of serialization. The cost of wrap execution serialization CUDA Programming and Performance	5	7178	July 9, 2008
Parallel thread processing in a warp CUDA Programming and Performance	5	3816	July 17, 2009
"Half-warps", scheduling, and branch divergence CUDA Programming and Performance	3	4371	February 24, 2013
Must all threads execute the same code? "Branch divergence occurs only within a warp" CUDA Programming and Performance	5	3040	December 28, 2008
Branch Divergence Serialization (Threads/hardware stalls ?) Performance Impact ? Branch divergence s CUDA Programming and Performance	3	1660	June 15, 2011
Loops in kernels CUDA Programming and Performance	2	1389	September 3, 2009
Divergent warps Divegent warps CUDA Programming and Performance	2	1040	October 30, 2011

Can CUDA be useful for me?

Related topics