How to use nvprof --metrics gld_efficiency on RTX2080ti

I want to measure global memory load efficiency,but when I run

nvprof --metrics gld_efficiency ./HellowWorld.exe

,it shows

nvprof --metrics gld_throughput .\HellowWorld.exe 32 32
======== Warning: Skipping profiling on device 0 since profiling is not supported on devices with compute capability greater than 7.2
==12656== NVPROF is profiling process 12656, command: .\HellowWorld.exe 32 32
SumOnGPU Time Cost 6.138 ms
sumMatrixOnGPU2D <<<(512,512), (32,32)>>>

==12656== Profiling application: .\HellowWorld.exe 32 32
==12656== Profiling result:
No events/metrics were profiled.

How can I measure global memory load efficiency on RTX2080ti,which compute capability is 7.5.

And when I run it on 1080ti(another GPU on my computer),it shows:

nvprof --metrics gld_throughput .\HellowWorld.exe 32 32
======== Warning: Skipping profiling on device 0 since profiling is not supported on devices with compute capability greater than 7.2
==16568== NVPROF is profiling process 16568, command: .\HellowWorld.exe 32 32
==16568== Error: Internal profiling error 4292:1.
SumOnGPU Time Cost 18.090 ms
======== Error: CUDA profiling error.

OS Win10 x64
CUDA 10.0

In my code,I use

cudaSetDevice();

to select GPU,but nvprof also shows "Skipping profiling on device 0 ",should I Shield 2080Ti with Environment Variable?

If I recall a previous post here, you should be using the Nsight profiler for compute capability above 7.2. My RTX 2080ti claims compute capability of 7.5.

I eagerly await confirmation or correction from those here who really know this stuff

Yes, to do metric gathering on kernels on Turing GPUs and beyond, you must use Nsight Compute. I recommend using the latest version in CUDA 10.1U1 or 10.1U2 (or whatever is the latest version).

Furthermore, these efficiency metrics are not currently available in the Nsight Compute tool. However an equivalent metric for global load efficiency could be global load transactions per request. That metric is also not available, but it can be assembled from the available metrics:

l1tex​_​_t​_sectors​_pipe​_lsu​_mem​_global​_op​_ld.sum (transactions)
l1tex​_​_t​_requests​_pipe​_lsu​_mem​_global​_op​_ld.sum (requests)

capture both of these metrics, and then divide the numbers. 100% efficiency is equivalent to 4 transactions per request. A higher number of transactions per request (up to 32 max) is indicative of reduced efficency.

[url]Nsight Compute CLI :: Nsight Compute Documentation