Things related to stall reasons... or not so related

1. A 6.1 SM has 128 cores and 4 schedulers. How exactly are the instructions scheduled?

As said in the document, each scheduler can issue one instruction at instruction issue time. Well, the first thing is that does this “issue time” happen each cycle? Assume it does. So only 4 instructions can be issued each cycle, doesn’t that only serve 64 cores? Or, there are two issue times per cycle? This sounds weird.

2. What’s an “instruction issued” sample?

Does it mean the warp is executing and no other thing happens? Is this actually a good guy?

3. Will FP32 operations ever trigger “pipeline busy”?

Is a pipeline busy because of there are more instructions than a specific unit can handle? So there is no way a FP32 unit can be too busy to accept a new instruction? edit: Or unless a FP32 instruction is stalled inside the pipeline for something else?

4. Why an instruction is not fetched?

Does this “fetch” thing mean fetching code from memory into the instruction cache? Or it’s something else? What’s the exact reasons of such stalls?

5. Does coalesced access have to happen in a single instruction?

It seems that no problem is reported if I read first 16 bytes of the cache line in one instruction and read next 16 bytes later. Is this OK, or actually not OK, or I should only do this when I really have to?

6. Non-temporal/Write Combined Writes.

Can I completely bypass caches in the kernel? …Since I rarely read back the data in the same kernel launch?

7. What exactly is the instruction latency?

The document says a FP32 instruction always has a 24 cycles latency. But some online sources say it varies, and for Pascal the result would be ready in 6 cycles.

8. Are “other” stalls mainly register bank conflicts?

9. Are there any tricks about synchronizations?

What happens when a warp is blocked by __syncthreads(), exactly?

Ideally I wish I could set some sort of synchronization point, do some thing else and then ask the warp to wait till the whole block has reached that synchronization point. So I can hopefully be able to hide most of the synchronization latency.

Does CUDA just simply blocks the warps till the BAR instruction is reached, or would it play some magic to minimize the impact of synchronizations?

a. Why I see texture as stall reason while I haven’t used any texture?

b. Are SASS instructions the final ones executed?

They seem to be rather complex to be executed directly (And it appears to me that every 3 instructions are packed together). Will they be further translated to some sort of microcode?

4x32 = 128

Yes, coalescing refers to the behavior of the memory controller, as it reacts to a single instruction executed warp-wide. For example refer to slide 17 here:

https://bluewaters.ncsa.illinois.edu/documents/10157/63094/NCSA02_Fundamental_CUDA_Optimization.pdf

CUDA C/C++ does not expose any explicit method to bypass the L2 cache for arbitrary loads and stores. I would normally say “the L2 cache cannot be bypassed” but then people argue about this or that in PTX or else they bring up the example of atomics and want to argue semantics of what they want to mean exactly by “bypass”, which I don’t want to do. For example, refer to slide 35 here:

http://on-demand.gputechconf.com/gtc/2013/presentations/S3466-Programming-Guidelines-GPU-Architecture.pdf

instruction latency is the difference in time (clock cycles) between when an instruction is scheduled (i.e. issued to a functional unit in the SM) and when its results are ready. For a floating point multiply, for example, it may be scheduled on a floating point multiplier in cycle X, and in cycle X+L the result will be usable by another instruction, where L is the latency in cycles. This latency will vary from one architecture to the next, and from one instruction to the next. Most instructions have some latency greater than 1 cycle. A related latency concept applies to things like global loads, where the latency is the time difference from the request of an item from global memory (i.e. when a global LD instruction is issued to a LD/ST unit), to when that item is actually available in a register to be used by another instruction. You make several references to “the document” but you don’t actually give a link to which document you are referring to.

some examples of “other” stall reasons:

http://stackoverflow.com/questions/14887807/what-are-other-issue-stall-reasons-displayed-by-the-nsight-profiler

In newer architectures (Maxwell, Pascal) the texture and L1 functionality are combined, so even if you don’t “use” texture, the texture load path may still be involved in ordinary global loads.

http://docs.nvidia.com/cuda/maxwell-tuning-guide/index.html#l1-cache

Yes.

Thanks, txbob.

Aren’t instructions issued in half warps?

About the latency, I cannot find the original ones… But here’s one refers only Maxwell. In the “Stall Counts” part.
https://github.com/NervanaSystems/maxas/wiki/Control-Codes

For Kepler, Maxwell, and Pascal, no.

For Fermi, yes, but for understanding this should be ignored. Treat Fermi as if it issued the instruction for the full 32 threads in the warp and ignore the “hotclock” (2x clock). The reason I say this is that Fermi had a “hotclock” architecture, but if we ignore the “hotclock”, then instructions were issued for a warp, not half warp. I don’t want to go into a lot of explanation of this.

For Tesla (an architecture that preceded Fermi, and was the “original” CUDA architecture), yes, but that architecture is no longer supported by current CUDA versions, and the current documentation does not reflect Tesla behavior. Note that coincidentally, “Tesla” was also chosen as a GPU brand by NVIDIA, so we have to be careful to disambiguate the architectural family from the GPU brand, if we want to talk about the Tesla architectural family.

In any event, for your suggested architecture of 4 schedulers and 128 SP’s, that could only be Maxwell or Pascal, and those architectures do not issue instructions in half warps.

After a quick perusal, I don’t see anything there which conflicts with my description of latency.

Thank you. That makes many thing cleared.

About the latency, actually I was meant to ask how much was the latency. After I read that maxas’s document I think it is right as it’s a assembler.


And a question about synchronization: Do I have to put bar.arrive and bar.sync on a same barrier in different warps? I has not yet figured out a way to test it.

Never mind about the synchronization thing. I think I found that they have to be in different warps.

Latency varies by instruction and by architecture. Even if you picked a specific case, it normally has to be arrived at by micro-benchmarking (which is what Scott Gray has done to some degree). It is not a published data/specification item.