a deep dive into Instruction-level parallelism

“instruction” means machine instruction. The ones you see when you use cuobjdump --dump-sass. You seem to be doing a deep-dive on the details of internal GPU execution mechanics. Many of those change between architecture generations and/or are undocumented. I don’t think this is a very fruitful way to tackle performance issues in CUDA programs. Due to the many GPU hardware changes since Mr. Volkov made his presentation, I would claim it is of limited usefulness today.

I see content related to occupancy on slides 35 and 36 of the linked presentation, which is not ILP in the classical sense. In general, when programming GPUs with CUDA

(1) It is moderately useful to worry about occupancy. But given that you have already found Volkov’s work, you should shortly come across his research demonstrating that occupancy is only weakly correlated with kernel performance (same presentation a few slides further in).

(2) It is almost never useful to worry about instruction-level parallelism. Exceptions may exist for ninja-level programmers.

On modern GPUs, many real-world use cases have become memory bound, and some are struggling to expose sufficient parallelism for the ever-increasing number of execution elements. So to first order, the key to good performance is:

(1) Get lots of threads going (tens of thousands)
(2) Optimize data access patterns, eliminate memory accesses where possible

Before you dive into machine-specific optimizations, think about algorithmic optimizations. No compiler and no profiler is going to you help with those. Beyond that, let your optimization efforts be guided by the CUDA profiler. It will identify bottlenecks for you. Start with the biggest bottleneck reported by the profiler. What does it say that is?

Have you read the Best Practices Guide? It contains many ideas on how to improve performance (many of which may not be applicable to any one specific application, of course).