CUDA execution mapping onto GPUs

FERcsI · March 2, 2009, 12:23pm

Hi,

I have some problem understanding the mapping of execution of kernels onto a real hardware.

1, Execution (warps)
“The multiprocessor maps each thread to one scalar processor core, and each scalar thread executes independently with its own instruction address and register state.” (CUDA programming Guide)

That’s fine, it is clear.

“The multiprocessor SIMT unit creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps.”

There is a problem here for me, since a SM contains only 8 SP-s. ( TPC-s contain at most 24 SP-s. ) Every SP contains two ALUs and an FPU. However, warps’ 32 threads should work fully parallelly. And even, they must share the same memory area, so it is impossible, that the same warp lies in different SM-s.

I’ve read in more docs, that a single SP executes a whole warp (with 32 threads), which sounds me more impossible: SP-s of SM-s work fully parallelly, so if different SPs in the same SM should run different warps, they should execute different code, which is impossible, since control system of these SP-s are common.

I made some measurements, and warps like to run fully parallelly, but blocks work “independently” from each other. Every processor executed 64 tasks (~two warps) at the same time, which corresponds to the specification (512 threads per block or per SM).

2, Pipeline
I can imagine that more data can be found in an ALU “parallelly” (one after each other), however it is easier to imagine that only thread switching is performed between blocks (or among threads in a warp?). I would be curious, how it works in NVIDIA GPUs.

Maybe 4 GPU clocks/instruction is a result (reason?) of executing 4 threads fully parallelly in an SP ( 4 x 8SP = 32 Threads). Is it?

3, Memory issues
In case, an SP executes ( two block x 1 warp ) (64 threads together): is the number of registers limited? Is it possible to use less threads, but more registers - or shared memory?

These information are important if one tries to find optimal solutions. I run tests on an 16384 element array. The best solution was a block size of 128 (I used 9800GTX w/ 112 SPs) - about 9-10,000 GPU clocks together, sizes of 512 and 32 some more (about 13,000), however e.g size of 192 was about 200,000! However, the per thread computation time was much better at the block size of 32.

So, my question is: How does GPU solve the problem of warps, how is parallel running in the same SM (or SP?) solved?

Thanks,
FERcsI

Topic		Replies	Views
Warps - Number of threads running concurrently CUDA Programming and Performance	4	2173	March 19, 2011
How more exactly a thread is executed on GPU CUDA Programming and Performance	9	3009	March 7, 2017
Beginner's question about concurrent warp execution. CUDA Programming and Performance	3	2504	July 4, 2019
GPU architecture and CUDA kernel execution CUDA Programming and Performance	13	24855	September 6, 2009
About Warps how Warps are allocated to SP/SM CUDA Programming and Performance	2	8327	September 11, 2009
How do CUDA cores on a SM execute warps concurrently? CUDA Programming and Performance	8	28726	July 4, 2019
CUDA threads and warps Teaching and Curriculum Support	3	7845	May 12, 2015
help me understand cuda CUDA Programming and Performance	4	6884	February 10, 2010
questions about sp and sm CUDA Programming and Performance	5	4036	June 19, 2019
Parallel thread processing in a warp CUDA Programming and Performance	5	3704	July 17, 2009

CUDA execution mapping onto GPUs

Related topics