at the start of this thread you said " I’m using a GTS 450 "
That is a 2.1 compute capability
So the 48 ‘cores’ are worked as 3 sets of 16 cores, so set ‘A’ will process one instruction for a warp in 2 cycles. Set ‘B’ can process an instruction for a different warp in the same 2 cycles, set ‘C’ can process a 2nd instruction for one of those warps in the same 2 cycles (if there is an instruction it can do in parrallel to what ‘A’ or ‘B’ is doing).
The number of threads each SM can support is a different property of the GPU and for a GTS 450 is 1536 threads ( 48 warps )
So it could for example have 8 active blocks each of 6 warps (192 threads), or fewer active blocks if they each have more threads per block.
Don’t worry about trying to write code that runs the maximum possible number of threads per SM. A SM can be kept flat out with just 2 blocks, until it needs to read or write data to global device memory (IO), it is when a block is waiting for IO that having another block waiting to process avoids the SM being idle for a microsecond. The number of blocks (and warps) needed to keep a SM running flat out depends on how much IO the kernel has to do and that depends on the application.
"all my threads in a block (1024) could access to shared memory? "
Having 1024 threads (32 warps) in a block may be a bit high, it means that the SM can not have a second active block (would exceed the 48 warps maximum). If your application really needs 1024 threads per block thats fine, but if you have only made it that high because that is the maximum allowed then try making it lower, like say 256. NB you may also have to reduce the amount of shared memory used by each block.
Remember that this is fermi architecture and not g80…
Just think that you have 32 cores as in figure…
Now you have two warp schedulers and shown on pg 8 you have four execution units (2 set of 16 cores, 1 set 16 Load/Store Units and 1 set of 4 Special Function Units)
Q: How many groups of thread are running simultaneaously per SM ?
Recall warp executes group of 32 threads i.e on hardware actually instructions are dispatched.
That instruction can be integer instructions, floating instructions, load, store, SFU instructions or Double precision instruction.
Depending on instruction some of the above mentioned 4 execution units are engaged.
If instruction dispatched by thread scheduler has to wait then whole warp waits and other warps are given the chance.
Now lets say we luckily have two warps executing integer instruction since we have dual warp scheduler these two warps occupy 2 set of 16 cores.
Now the remaining two execution units on pg 8 do nothing.
Now a warp take two cycles to complete execution i.e u can say 16 cores are executing given integer instruction for first 16 threads of warp in first cycle and then last 16 threads of warp execute in second cycle.
So now u can see that actually two different half warps are executing on 2 set of 16 cores in a given single cycle.
So for this case u can say that 32 threads are executing concurrently on a SM which belong to two different warps.
But things differ when we have instructions like double precision instruction or SFU instruction check the white paper.
Q: How many threads are running simultaneaously in each block (32 threads for a warp or is it 1024 ?) ?
Physically for a SM according to above example its 32… But as a programmer while programming you have to think as 1024 are max allowed threads which will execute as warps of 32 on SM. These warps can belong to different blocks… If some warp have to wait for say some load/store instruction then if other warp is available it jumps in to occupy 16 cores (not 32 cores) and get executed in two cycles.
As given in white paper
“Fermiâ€™s dual warp scheduler selects two warps, and issues one instruction from each warp to a group of sixteen cores, sixteen load/store units, or four SFUs.”
“While programmers can generally ignore warp execution for functional correctness and think of programming one thread, they can greatly improve performance by having threads in a warp execute the same code path and access memory in nearby addresses.”
Q: How threads are distributed per CUDA cores ?
I think u can figure it now :-)
Now ur statement " I’m asking this question because I don’t see the point to have those many CUDA cores per SM (48 ) and just use 8 blocks."
Remember that u can have max of 8 blocks per SM.
And total warps these blocks make must not exceed [[1024/(NumberOfBlocks*32)]]
Note: I was talking in terms of Fermi hardware with 32 core … for getting a better view on hardware with 48 core … I think this is best article
Oh I think I replied late actually I compiled this reply 4 hours ago and I posted it now … any way guys please read my above post and point out corrections if any will help in improving my understanding too
Maybe you have to explain what you have in mind for your CUDA development, and precisely wich generation you target, because between G80 (my venerable GeForce 8800 GTS 320MB) and a new Fermi w/ 3xHalf-warp (such as in 5xx generation), it’s not the same game, not talking about my GTX 260 (caching but 32 threads per SP bare)…