Is there any archive version of Nsight4VisualStudio, which was able to report bootlenecks of OpenCL kernels?

I need to optimize OpenCL kernels for NVidia GPUs (Pascal, P1000, P2000).

Have been there any version of NSight which was able to show occupancy, register count,
shared memory bottleneck, etc. for OpenCL Kernels?

If yes, which is the latest version of those?

Have been there any profiling tool on either Win or Linux for nVidia GPUs
which was capable to report these on OpenCL kernels?

I now that earlier nvvp could be hacked to produce measurements, via command line interface,
but that interface was removed from CUDA Toolkit versions later than 7.5.
Unfortunately, our target devices require later versions.
Links: (first one does not include the version 7.5 limitation, as it was written before)