Hi,
just starting to learn this thing…I’m kinda confused with the warp concept.
So, a kernel contains BLOCKs, and a BLOCK contains THREADs.
Each BLOCK is ran on one Multiprocessor, each THREAD in it is ran by each Streamprocessor. (alright so far?)
And then comes the ALUs that operates at twice the clock of SPs, so we should have multiples of 16 THREADs on each blocks.
So what’s the deal with WARPs? :huh: Why does the hardware separate groups THREADs into WARPs? Guide says each WARP contains the same number of THREADs, but how many exactly? How is it divided?
According to the programming guide, g80 hardware assigns 32 threads to a warp.
Also, kernels don’t really contain blocks. A kernel is just a piece of code to be run on the gpu. Threads are arranged into thread-blocks by the programmer (up to 512 threads per block, either 1D or 2D numbering of threads). Threadblocks are arranged into a grid, again, by the programmer. At run-time, threadblocks are scheduled dynamically for execution on multiprocessors. From multiprocessors point of view, a threadblock contains warps of threads. A given warp’s threads are processed by SPUs “concurrently.” Warps also come into play when coalescing global memory accesses. See the section on performance in the programming guide for more details.
So that’s why one sect. 6.2, the available registers / thread is R / (B x ceil(T,32)); because the MP always processes the threads in groups of 32 (per warp). It is really splitting registers according to number of warps, not threads.
That’s why it also says “64 threads per block is minimal” because you want to run them concurrently, and have them as multiple of 32.
By the way, what does “run concurrently” means? I read in a lecture, that it is good to have 1000s of blocks. If I have 1000 blocks and 16 MPs in a card, that will be 62 blocks on each MPs; the limit is 8 blocks running concurrently on each MPs though?
Hi, I believe your question is typical for any newcomers to CUDA. I myself isn’t good at it though. To understand it needs sythetic knowledge of Computer Architecture, Compiler and OS.
I think a each multiprocessor is a SIMD, but inter-multiprocessor are only SCMD -single kernel, multiple data.
It might not be correct: I think “run concurrently” means having instance (like C++'s) of blocks on multiprocessor. The 1000 blocks are not instanced meantime; a multiprocessor works at 8 (in best case), which are scheduled in a waitMemory-execute state circle. The reason why so much as to 1000 blocks are recommended might be that, say, for the same task, 1000 blocks may have lighter-weight threads than that 100 blocks have. And the lighter the threads, the better the warp divergence is alliviated.
My impression is, the number of the block will not change the weight of the threads, since they are just time-sliced with a switch…
Reading from other thread, seems like it’s important to
Make as many threads as possible in a block, but not too much that it creates too much delay during synchronization or doesnt give you enough registers to use.
Will someone tell me if anything is wrong with this?
You’re right, the limit is 8 blocks per SM (streaming multiprocessor…what I think you refer to as MP) as there aren’t enough resources to process all 1000 blocks at the same time, or “concurrently”.
Someone correct me if I’m wrong…but it is good to have thousands of blocks in order for your application performance to scale well with future hardware. If your program only has a 100 blocks now (not enough parallelism!), you probably won’t see much performance gain with future hardware.
This is the reason for the recommendation (!) to have 1000s of blocks. The time slicing is not static scheduled but can also occur if threads stall because they wait for data from a memory fetch or similar. A multiprocessor can only run a single block at any one time. If all threads of the block are stalled however, it can put the block to sleep and start another one. That way, it can handle up to 8 blocks. If you have 16 multiprocessors, you need at least 16*8=128 blocks to allow every multiprocessor to run at full capacity. The recommendation (not requirement) for future hardware thus is to have 1000s as the number of multiprocessors will go up.
Within a block, the threads experience a similar scheduling. Currently there are 8 ALUs that can execute thread code. They run at twice the multiprocessor speed which is why effectively 16 threads get advanced per multiprocessor cycle. The warp size is the number of threads scheduled together. So (currently) each warp takes 2 multiprocessor cycles. Future hardware can do things differently, which is why you should always program in a way that vectorizes on the warp size to be on the safe side. The warp size of the hardware you are executing on can be queried with cudaGetDeviceProperties at runtime. You should take it into account when setting the grid and thread layout for a kernel to adapt your program to the actual hardware.
That’s not quite correct. The grid configuration (threads per block, thread arrangement (1D, 2D, etc.), and block arrangement) can affect performance. Issues come up when one tries to ensure memory coalescing, maximizing occupancey, and so on. It often pays to experiment with several arrangements to determine what’s optimal for your application.