CUDA execution mapping onto GPUs


I have some problem understanding the mapping of execution of kernels onto a real hardware.

1, Execution (warps)
“The multiprocessor maps each thread to one scalar processor core, and each scalar thread executes independently with its own instruction address and register state.” (CUDA programming Guide)

That’s fine, it is clear.

“The multiprocessor SIMT unit creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps.”

There is a problem here for me, since a SM contains only 8 SP-s. ( TPC-s contain at most 24 SP-s. ) Every SP contains two ALUs and an FPU. However, warps’ 32 threads should work fully parallelly. And even, they must share the same memory area, so it is impossible, that the same warp lies in different SM-s.

I’ve read in more docs, that a single SP executes a whole warp (with 32 threads), which sounds me more impossible: SP-s of SM-s work fully parallelly, so if different SPs in the same SM should run different warps, they should execute different code, which is impossible, since control system of these SP-s are common.

I made some measurements, and warps like to run fully parallelly, but blocks work “independently” from each other. Every processor executed 64 tasks (~two warps) at the same time, which corresponds to the specification (512 threads per block or per SM).

2, Pipeline
I can imagine that more data can be found in an ALU “parallelly” (one after each other), however it is easier to imagine that only thread switching is performed between blocks (or among threads in a warp?). I would be curious, how it works in NVIDIA GPUs.

Maybe 4 GPU clocks/instruction is a result (reason?) of executing 4 threads fully parallelly in an SP ( 4 x 8SP = 32 Threads). Is it?

3, Memory issues
In case, an SP executes ( two block x 1 warp ) (64 threads together): is the number of registers limited? Is it possible to use less threads, but more registers - or shared memory?

These information are important if one tries to find optimal solutions. I run tests on an 16384 element array. The best solution was a block size of 128 (I used 9800GTX w/ 112 SPs) - about 9-10,000 GPU clocks together, sizes of 512 and 32 some more (about 13,000), however e.g size of 192 was about 200,000! However, the per thread computation time was much better at the block size of 32.

So, my question is: How does GPU solve the problem of warps, how is parallel running in the same SM (or SP?) solved?