- does instruction mix have a toll on performance or not? Most articles suggest unless there is register dependency, the scheduler can schedule 2 instructions from the same warp. However, there is one article published at UC Berkeley http://digitalassets.lib.berkeley.edu/etd/ucb/text/Volkov_berkeley_0028E_16465.pdf that very briefly talks about instruction mix (middle of page 24). To quote,
“Another factor responsible for mapping between warps and instructions is instruction mix. It
refers to the proportion in which different instruction types are present in the code. When several
instruction types are present in the same code, which is the common case, executing an instruction
of one type in a warp typically implies not executing instructions of other types in the same warp
at the same time. In result, the number of concurrently executed instructions of any particular type
may be smaller than the number of concurrently executed warps.”
However, the author does not detail regarding what it means to mix instructions, and what the toll is. Any more detail on instruction mix, toll in performance, and how to avoid would be greatly appreciated!
When we say a warp scheduler can schedule 2 “instructions” from a warp in a single cycle, does “instruction” mean assembly-level instructions, or C CUDA level instructions?
I am working to improve ILP of a program. How do you decompose a single line of code that may be composed of multiple instructions? For an example, in the below code, the last line should at least be decomposed to 2 instructions, although it is a single line. But this is pretty obvious since we are dealing with shared memory ( and it would be obvious for global memory as well ). But it is not as obvious for register loads and writes ( for an example, how many instructions is the second to the last line composed of ? ) Is there any literature on that?
// some boiler plate definition of a device kernel __shared__ float i; float y = 1.0f; y = y + 2.0f; i = 0.1f + y;
- if you take a look at this keynote, https://www.nvidia.com/content/GTC-2010/pdfs/2238_GTC2010.pdf, on page 35 and 36, instruction-level parallelism increases as consecutive memory accesses grow from 2 to 4 and 8. It makes sense how ILP would improve from 1 to 2 consecutive memory accesses, since a warp scheduler can schedule 2 instructions at once. so, shouldn’t anything more than 2 show no difference? But it makes a significant difference according to the author. How would this be explainable?