There are two performance counters in GPU, which are “instructions issued” and “instructions executed”. Could someone clarify the differences between these two? What does it mean by “replays”?
They said “instructions executed = instructions issued + replays”, does it mean instructions executed are ALWAYS greater than instructions issued? (As I experimented, sometimes it is not). If not, when are the instructions executed smaller than instructions issued?
Replays are a technique employed by a multi-threaded processor to avoid stalling the pipeline when a long latency event occurs. When an instruction is issued, pipelined processors will continue to issue instructions behind it as long as they do not depend on the issued instruction, or the dependencies can be resolved by forwarding results. This is fine in most cases, the original instruction will be executed quickly and the pipeline will continue to process instructions. However, sometimes the original instruction can encounter a long latency event where it cannot complete quickly (it may be a load that misses in the cache, there may be a conflict for a shared port, etc). In these cases, simple processors may just stall the pipeline. This is fine for single threaded, in-order processors, but it is bad for multi-threaded processors because there are usually independent instructions from different threads that can be executed immediately without waiting. In this case you really want the long latency instruction to ‘get out of the way’. Replays help in this case by squashing the instructions in the pipeline and beginning executing instructions from a different thread. The original instruction will be ‘replayed’ again at some later time, hopefully at which time it will execute quickly. Some time is wasted because instructions are squashed, but it is typically on the order of a few instructions, rather than the hundreds of cycles that would be needed to wait for a cache miss.
This paper gives a good overview of mechanisms for tolerating long latency events (replays, skids, out of order execution, etc): http://www.google.com/url?sa=t&source=web&cd=7&sqi=2&ved=0CEgQFjAG&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.140.3533%26rep%3Drep1%26type%3Dpdf&rct=j&q=skid%20buffers%20keckler&ei=zkoKTrbTKpK6sAPL1YWcCQ&usg=AFQjCNEMFDa1XL-qaUICrIymMPxDTe0Mcg
Assuming that a GPU is using ‘replay’ as the mechanism for handling long latency events, instructions issued should always be greater than or equal to instructions executed. Instructions that were squashed due to a replay will be issued multiple times.
You can play around with the profiler and see what types of instructions cause replays.
Thanks a lot for your information. That’s probably the best explanation I can find for a while.
I have another question about the instruction count measured by those counters. Are they ptx instrutions, not the GPU instructions (machine code)? If so, basically, assuming the operands are all ready, each instructions will take different number of clock cycles to finish executing, right? Because ptx instructions can be resolved into several GPU instructions.
Glad to help :)
The counters report machine instructions, not PTX instructions. Even so, they will take a different number of cycles to execute, depending on the instruction (e.g. mul is slower than add).
Thanks Gregory for your information. Is it possible to let me know from which documents (e.g, Nvidia’s paper) did you get that GPUs use this kind of technique? I would have to cite them in my work.
I’ve read the paper you suggested, but it doesn’t mean that GPU uses the instruction replay techniques.