a deep dive into Instruction-level parallelism

More likely than not (you should be able to confirm with the help of the CUDA profiler), this change

(1) reduced dynamic instruction count by allowing all but the first address to be computed by simple addition
(2) allowed the compiler to schedule loads earlier and batch them, increasing latency tolerance and improving the efficiency of memory accesses

I would not classify this as an ILP-related technique (which is why I asked), but I guess one could argue about the exact definition of that. To first order, a GPU is a scalar processor which can schedule one instruction per thread per cycle for execution, i.e. the question of ILP doesn’t even come up. There are various exceptions to this first-order description which vary by GPU architecture, but I am not aware of any GPU that shipped since CUDA came into existence that issues more than two instructions per thread per cycle under any conditions.

If someone has better information, I encourage comments on this. There have been too many different architectures to keep all details in my memory.