1. A 6.1 SM has 128 cores and 4 schedulers. How exactly are the instructions scheduled?
As said in the document, each scheduler can issue one instruction at instruction issue time. Well, the first thing is that does this “issue time” happen each cycle? Assume it does. So only 4 instructions can be issued each cycle, doesn’t that only serve 64 cores? Or, there are two issue times per cycle? This sounds weird.
2. What’s an “instruction issued” sample?
Does it mean the warp is executing and no other thing happens? Is this actually a good guy?
3. Will FP32 operations ever trigger “pipeline busy”?
Is a pipeline busy because of there are more instructions than a specific unit can handle? So there is no way a FP32 unit can be too busy to accept a new instruction? edit: Or unless a FP32 instruction is stalled inside the pipeline for something else?
4. Why an instruction is not fetched?
Does this “fetch” thing mean fetching code from memory into the instruction cache? Or it’s something else? What’s the exact reasons of such stalls?
5. Does coalesced access have to happen in a single instruction?
It seems that no problem is reported if I read first 16 bytes of the cache line in one instruction and read next 16 bytes later. Is this OK, or actually not OK, or I should only do this when I really have to?
6. Non-temporal/Write Combined Writes.
Can I completely bypass caches in the kernel? …Since I rarely read back the data in the same kernel launch?
7. What exactly is the instruction latency?
The document says a FP32 instruction always has a 24 cycles latency. But some online sources say it varies, and for Pascal the result would be ready in 6 cycles.
8. Are “other” stalls mainly register bank conflicts?
9. Are there any tricks about synchronizations?
What happens when a warp is blocked by __syncthreads(), exactly?
Ideally I wish I could set some sort of synchronization point, do some thing else and then ask the warp to wait till the whole block has reached that synchronization point. So I can hopefully be able to hide most of the synchronization latency.
Does CUDA just simply blocks the warps till the BAR instruction is reached, or would it play some magic to minimize the impact of synchronizations?
a. Why I see texture as stall reason while I haven’t used any texture?
b. Are SASS instructions the final ones executed?
They seem to be rather complex to be executed directly (And it appears to me that every 3 instructions are packed together). Will they be further translated to some sort of microcode?