how many clock cycles for one instruction

I am curious how many clock cycles for one instruction(integer op is probably different from floating point op)? in the document, it says typically 22 or 24 for most instructions. Is there detailed information about this?

Which document? It appears you are inquiring about the latency of instructions, that is, the time one instruction takes to traverse the internal processing pipeline of a CUDA core? If so, the number will likely differ by instruction type and GPU architecture. It is also likely to be somewhat higher than that of modern CPUs, which require around 15 pipeline stages for most instruction.

In the context of CUDA programming, instruction latency should not matter (to first order) since GPUs are designed as throughput machines with the assumption that programmers will run enough threads for a kernel to cover all the basic latencies, which includes instruction latency. That said, the compiler does try to schedule long-latency instructions, such as loads, early.

This document:

I want to estimate the latency of one programme, and predict which is the fastest algorithm among several programmes without running on a GPU.

I searched almost every public resource online but found nothing about latency of one single instructions.

By the way, how do researchers build simulation models to estimate GPU latency?

On Maxwell, for cuda core instructions (most int, fp, logic instructions) the pipeline depth is just 6 clocks. I’m not sure what it is on Kepler but I know it’s deeper. Using my assembler you can easily measure the latency, throughput and instruction queue depth of any instruction you like. Here’s some more details on the Maxwell arch (I’ve recently updated it with a bit more info and made some corrections):

The six cycles measured on Maxwell through microbenchmarks are presumably the number of execution stages of the total pipeline (from RF read to result bypass), established by executing a long sequence of dependent instructions? The total length of the pipeline (instruction fetch through RF writeback) should be considerably longer, I would think. By “fp” I assume you are referring to single-precision operations only?

Best I know, for Haswell FMAs the latency of dependent instructions is 5 cycles, while the total length of the pipeline is something like 15 cycles. Since GPUs, as throughput architectures, do not put a premium on latency I would guess that their overall pipeline length could be longer than that. But I have no direct knowledge since in all my years of CUDA programming I have never encountered a need for that data.

I am not aware of detailed cycle-based GPU simulators that are publically available. If they do exist, I would assume they dial up cycles counts based on a mix of reasonable assumptions and data measured via microbenchmarks like those used by Scott.

6 clocks is simply the amount of time that needs to transpire before the output register from one instruction can be used as input for a subsequent instruction. I suppose the actual pipeline could be deeper than that, but I’m not sure how. On Maxwell there is only one shared dp unit per SM, so that’s a variable latency instruction that needs to be synchronized with barriers, not stall counts.

But, the more you know about all the instruction metrics, the more effective you can be at scheduling instructions to avoid bubbles in the pipeline and boost your ILP. This lets you run at low occupancy with large register allocations. This is generally more power efficient as you can keep more data closer to compute. And more power efficient nowadays means higher clocks.

On Kepler, for most instructions(ie int, fp, logic) the pipeline depth is 9 clocks, you can get this latency form below askepler assembler easily!
Askepler assembler:
If you have any question with using askepler, you are welcome to reply this poster: