I am new to CUDA. The “Performance Guidelines” section of the CUDA manual says that a global memory access takes lotsa clock cycles and that it would be prudent to overlap it with execution of another block on the multi-processor.
From the programmer’s view point – HOw do I get to do this? WHen I launch my kernel I just specify the block and grid dimensions? How do I specify how many blocks have to be exeucted in a Multi-Processor???
Is it done by launching another kernel?? In that case, Can a Multi-Processor multi-task between blocks belonging to different kernels?? If so, who effectuates this multi-tasking? I have one more reason to think of it as a different kernel. Because the manual talks about sharing the shared-memory among the two different blocks. It says the shared memory usage of a block must at the MAX be 1/2 of shared memory available. This also covers usages like one block using 1/4th shared memory and other block using 1/2 of shared memory and one another block using 1/8th of shared memory and so on. These varying sizes are possible only if they belong to different kernels.
Also, the manual talks about “first half” and “second half” of warps. Does it mean that these warps can be scheduled out and scheduled-in just once in their life time??? Has this half-warp thing got something to do with block scheduling?
THanks for any help.