The warp is split across 2 cycles, 16 threads at a time. The “4 processing blocks with 16 cores each”, is referred to as an SMSP - SM Sub Partition. Although answering a question about instruction latency, Greg’s answer here may clarify things. His " EXAMPLE 1 : 1 Warp per SM Sub-partition shows the ALU active for two consecutive cycles processing all 32 threads.
Related topics
Topic | Replies | Views | Activity | |
---|---|---|---|---|
Newbie confusion: thread, block, multiprocessor and processor | 2 | 1225 | April 13, 2011 | |
question about warp, block and threads | 4 | 2023 | February 3, 2009 | |
Warp Size Question | 21 | 14103 | June 18, 2010 | |
No.of threads per scalar processor | 6 | 6524 | July 10, 2009 | |
How they work betweem SM and block SM, SP, Block, Thread and so on. | 1 | 4332 | January 8, 2008 | |
Multiprocessors or Cuda Cores | 25 | 19908 | July 5, 2011 | |
Whats a WARP for? | 8 | 6509 | June 21, 2007 | |
Simple summary of CUDA execution model An attempt to simplify and summarize various sources on execu | 7 | 5603 | July 28, 2009 | |
Blocks/Warps/Threads Allocation I have some doubts about the allocation of blocks/warps/thread in CU | 5 | 2613 | November 1, 2012 | |
GPU architecture and CUDA kernel execution | 13 | 24970 | September 6, 2009 |