Nsight Compute slows down Tesla T4 processor clock during profiling

I am trying to optimize some existing Cuda kernels for use on the Tesla T4 under the CUDA 10.2 tools. I have discovered that the default “application” processor clock rate is 585 MHz and that there is no auto boost for this device. I used nvidia-smi to increase the clock speed to the maximum for applications, and I made this permanent. However, when I run Nsight Compute to profile the kernels, the processor clock is again limited to 585 during all the successive replays of these kernels. There is no throttling of the clock due to temperature issues.
If I run my test application in Nsight Systems, the clock is not throttled in this fashion and gives me acceptable timings for the kernels.
My questions are: Is this limiting of the processor clock known behavior for Nsight Compute? Can the clock rate be increased during profiling? In general, my optimizations do not show much in the way of kernel speed increases at the lower clock rate, but under the maximum processor clock, the improvements are much more significant.
I am used to NVVP telling me where inefficiencies occur in the code. Looking at these kernels under Nsight Compute, I am unable to get the python tools to return any results or suggestions for optimizations. I’m wondering if I am doing something wrong or are there truly no additional optimizations I can make? I am running this on a remote T4 device, CentOS 7 and Cuda Tools 10.2 both locally and on the remote system.

For many metrics, their value is directly influenced by the current GPU SM and memory clock frequencies. For example, if a kernel instance is profiled that has prior kernel executions in the application, the GPU might already be in a higher clocked state and the measured kernel duration, along with other metrics, will be affected. Likewise, if a kernel instance is the first kernel to be launched in the application, GPU clocks will regularly be lower. In addition, due to kernel replay, the metric value might depend on which replay pass it is collected in, as later passes would result in higher clock states.

To mitigate this non-determinism, NVIDIA Nsight Compute attempts to limit GPU clock frequencies to their base value. As a result, metric values are less impacted by the location of the kernel in the application, or by the number of the specific replay pass.

However in your case since you are using nvidia-smi to increase the clock speed you can adjust the Nsight Compute --clock-control option to specify if any clock frequencies should be fixed by the tool.

Also note that you can try and use Nsight Systems instead of NVVP. The Nsight Systems GUI provides significant increases in responsiveness and scalability with the size of the profile. You can visualize significantly more information at a glance from the timeline. Nsight Systems also enables a holistic view of the entire system, CPU, GPU, OS, runtime, and the workload itself, reflecting that real world performance is multifaceted and not just a matter of making a single kernel go fast. This is all done with low overhead profile collection and minimal perturbation. You can refer the NVIDIA developer blog Transitioning to Nsight Systems from NVIDIA Visual Profiler / nvprof

I believe that the --clock-control option is only available using the cli version of Compute.

The “Clock control” option is also available in the UI in the Connection dialog. For the “Profile” activity it is available under the “Other” options tab.
For the “Interactive Profile” activity it is available at the top level.