Instruction-Level Profiling via nvprof?

Has anyone had any luck with instruction-level profiling in CUDA 7.5? An example is outlined here:

http://devblogs.nvidia.com/parallelforall/cuda-7-5-pinpoint-performance-problems-instruction-level-profiling/

but I have yet to get it to work.I get the “Kernel Profile - PC Sampling” report in nvvp with a kernel-level sample count and the sample distribution pie chart, but there is no section below that listing source files or functions. There is an icon next to the minimize/maximize buttons for the results window that presumably allows you to add source file mappings, but it does not work. Clicking the icon pops up a modal “Source Files Mapping” window, but nothing happens when you click “Add Mapping”. I use out-of-tree builds via CMake, but I have tried copying the executable into the source directory and running nvvp directly from there with no luck. I’m using Linux and nvvp has never seemed to work there, so perhaps it works better in Windows.

I typically use nvprof, but I cannot find any associated flags to generate instruction-level profiles. Does anyone know if this functionality is somehow exposed in nvprof? I looked through the available metrics via --query-metrics, but I don’t see anything related to program counter sampling.

I am using a GTX 980 Ti (CC 5.2) and built the code for sm_52.