Parallel thread processing in a warp

Hello guys,

I am new to CUDA programming and had a question regarding parallel execution of threads in a warp. As we all know a warp has 32 threads and are processed concurrently on a multiprocessor. But, a multiprocessor has only 8 scalar processors, so I am not able to comprehend how 32 threads can get processed in parallel using only 8 processing elements.

any input on this is greatly appreciated.


I have a single-core CPU machine… but it runs multiple applications simultaneously… Howw??

The same time-slicing logic holds good…

To be more precise, all 32 threads in a WARP execute the same instruction at the same time (provided no divergence). So, each processing element executes an instruction on behalf of 4 threads…

My understanding:

Each MP executes 8 threads concurrently, but they perform every instruction 4 times, so that is why you have 32 threads that must perform the same operation (a ‘warp’).

This is pretty much how it works.

I have the same opinion.

I also make an experiment about branch’s affect.

But it shows me something that doesn’t match my thinking.

When the threads in a warp with different branch, they should be executed serialized.

I only launch one thread block and with 32 threads inside.

Branch_test_kernel3 is the kernel that without branch, and Branch_test_kernel4 with branches, one thread follow it’s absolutely path.

and the result is :

Kernel 3 start computing…

Kernel 3 Computing finish!


Kernel 3 Average processing time: 1.647360 (ms)

Kernel 4 start computing…

Kernel 4 Computing finish!


Kernel 4 Average processing time: 1.638650 (ms)

The times the two kernel take are almost the same.

Did I make a mistake or something I ignore?

Attached are the test code: the kernel and main. (1.94 KB) (944 Bytes)

I have done a similar test with fmuladd bench, but with an half-warp (16 threads) following one branch and the other another branch, same instructions, different constants.
It wasn’t optimized by the compiler, and the execution time was twice the unified computation time (no branch).

Your code seems to be global memory-bound, instead of GPU bound, and as global-memory access are a matter of hundreds cycle each time, with a write-back strategy, , the GPU will wait for Read-after-write conditions.

You may:

  • check the generated ptx code
  • indicate what’s your CUDA Device (GPU/CUDA Device level)
  • use registers to avoid global memory writes (with a final write to avoid “smart” optimization that will remove all the processing!)