Parallel thread processing in a warp

randal · July 7, 2009, 8:40pm

Hello guys,

I am new to CUDA programming and had a question regarding parallel execution of threads in a warp. As we all know a warp has 32 threads and are processed concurrently on a multiprocessor. But, a multiprocessor has only 8 scalar processors, so I am not able to comprehend how 32 threads can get processed in parallel using only 8 processing elements.

any input on this is greatly appreciated.

Thanks
Randal

Sarnath · July 7, 2009, 9:08pm

I have a single-core CPU machine… but it runs multiple applications simultaneously… Howw??

The same time-slicing logic holds good…

To be more precise, all 32 threads in a WARP execute the same instruction at the same time (provided no divergence). So, each processing element executes an instruction on behalf of 4 threads…

BeachHut · July 7, 2009, 9:14pm

My understanding:

Each MP executes 8 threads concurrently, but they perform every instruction 4 times, so that is why you have 32 threads that must perform the same operation (a ‘warp’).

tmurray · July 7, 2009, 10:10pm

This is pretty much how it works.

Austin · July 8, 2009, 2:32am

I have the same opinion.

I also make an experiment about branch’s affect.

But it shows me something that doesn’t match my thinking.

When the threads in a warp with different branch, they should be executed serialized.

I only launch one thread block and with 32 threads inside.

Branch_test_kernel3 is the kernel that without branch, and Branch_test_kernel4 with branches, one thread follow it’s absolutely path.

and the result is :

Kernel 3 start computing…

Kernel 3 Computing finish!

Success!

Kernel 3 Average processing time: 1.647360 (ms)

Kernel 4 start computing…

Kernel 4 Computing finish!

Success!

Kernel 4 Average processing time: 1.638650 (ms)

The times the two kernel take are almost the same.

Did I make a mistake or something I ignore?

Attached are the test code: the kernel and main.
Branch_test.cu.tar.gz (1.94 KB)
Branch_test_kernel.cu.tar.gz (944 Bytes)

parallelis · July 17, 2009, 7:34pm

I have done a similar test with fmuladd bench, but with an half-warp (16 threads) following one branch and the other another branch, same instructions, different constants.
It wasn’t optimized by the compiler, and the execution time was twice the unified computation time (no branch).

Your code seems to be global memory-bound, instead of GPU bound, and as global-memory access are a matter of hundreds cycle each time, with a write-back strategy, , the GPU will wait for Read-after-write conditions.

You may:

check the generated ptx code
indicate what’s your CUDA Device (GPU/CUDA Device level)
use registers to avoid global memory writes (with a final write to avoid “smart” optimization that will remove all the processing!)

Topic		Replies	Views
A question the parallelization CUDA Programming and Performance	5	2694	July 29, 2008
Warps - Number of threads running concurrently CUDA Programming and Performance	4	2167	March 19, 2011
Number of threads physically executing in parallel per core? Whats the physical level of parallelism CUDA Programming and Performance	5	12314	November 8, 2010
CUDA execution mapping onto GPUs CUDA Programming and Performance	0	2818	March 2, 2009
Execution of warps CUDA Programming and Performance	1	1552	January 7, 2009
Threads per warp vs number of cores CUDA Programming and Performance	2	2602	February 3, 2009
Basic question about warps CUDA Programming and Performance	14	6579	June 9, 2009
warp and core What's the relationship between warp and core? CUDA Programming and Performance	12	15570	February 4, 2011
768 threads vs warp CUDA Programming and Performance	2	1458	August 16, 2009
Thread Scheduling Concept CUDA Programming and Performance	3	3696	June 21, 2012

Parallel thread processing in a warp

Related topics