P100 global_hit_rate and and tex_cache_hit_rate

In relation to another post I have opened, I would like to understand how the two topic mentioned metrics are calculated. I wrote this small kernel

__global__ void test(
        float * in, float2 *out)
        int x=threadIdx.x;
                float a=__ldg(in+x);
                atomicAdd(&out[x].x,a); //      out[x]=(float)(x*j);

and executed over a P100 with 1 block and 32 threads (ie one warp)

The sass looks like

code for sm_60
                Function : _Z4testPfP6float2
        .headerflags    @"EF_CUDA_SM60 EF_CUDA_PTX_SM(EF_CUDA_SM60)"
                                                                                /* 0x083fc400e3e007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20];                        /* 0x4c98078000870001 */
        /*0010*/                   S2R R0, SR_TID.X;                            /* 0xf0c8000002170000 */
        /*0018*/                   ISCADD R4.CC, R0.reuse, c[0x0][0x140], 0x2;  /* 0x4c18810005070004 */
                                                                                /* 0x001dc400fc4007ec */
        /*0028*/                   SHR R5, R0, 0x1e;                            /* 0x3829000001e70005 */
        /*0030*/                   IADD.X R5, R5, c[0x0][0x144];                /* 0x4c10080005170505 */
        /*0038*/                   LDG.E.CI R4, [R4];                           /* 0xeed4a00000070404 */
                                                                                /* 0x001f8800fd8207f1 */
        /*0048*/                   ISCADD R2.CC, R0.reuse, c[0x0][0x148], 0x3;  /* 0x4c18818005270002 */
        /*0050*/                   SHR R0, R0, 0x1d;                            /* 0x3829000001d70000 */
        /*0058*/                   IADD.X R3, R0, c[0x0][0x14c];                /* 0x4c10080005370003 */
                                                                                /* 0x001ffc001e2040f1 */
        /*0068*/                   RED.E.ADD.F32.FTZ.RN [R2], R4;               /* 0xebf9000000370204 */
        /*0070*/                   RED.E.ADD.F32.FTZ.RN [R2+0x4], R4;           /* 0xebf9000040370204 */
        /*0078*/                   EXIT;                                        /* 0xe30000000007000f */
                                                                                /* 0x001f8000fc0007ff */
        /*0088*/                   BRA 0x80;                                    /* 0xe2400fffff07000f */
        /*0090*/                   NOP;                                         /* 0x50b0000000070f00 */
        /*0098*/                   NOP;                                         /* 0x50b0000000070f00 */
                                                                                /* 0x001f8000fc0007e0 */
        /*00a8*/                   NOP;                                         /* 0x50b0000000070f00 */
        /*00b0*/                   NOP;                                         /* 0x50b0000000070f00 */
        /*00b8*/                   NOP;                                         /* 0x50b0000000070f00 */

When profiling I get this value as Hit rates

      1                           global_hit_rate                     Global Hit Rate in unified l1/tex      50.00%      50.00%      50.00%
      1                        tex_cache_hit_rate                                Unified Cache Hit Rate      83.33%      83.33%      83.33%
      1                      l2_tex_read_hit_rate                           L2 Hit Rate (Texture Reads)       0.00%       0.00%       0.00%

I can’t understand these rates. For example, why global_hit_rate is 50%. There is only one load, it should be 0%. AtomicAdds are clearly affecting the tex_cache_hit_rate but I can’t find any reason why it should be 83.33%

Any help to clear up the matter, really appreciated

If I run a slightly modified version of vector add with 10,000,000 elements and the kernel

__global__ void
test(const float* __restrict__ in, float2 *out, int numElements)
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements)
		float a = in[i];
        atomicAdd(&out[i].x, a);
        atomicAdd(&out[i].y, a);

I have similar access pattern and the const restrict results in LDG.E.CI.

Nsight Compute on GTX1080 (GP104) PerfWorks metrics report

  • tex__hitrate_pct 0%
  • lts__request_total_sectors_hitrate_pct 40%
  • NVPROF on GTX1080 reports

  • global_hit_rate 0.00%
  • tex_cache_hit_rate 83.33%
  • l2_tex_read_hit_rate 0.00%
  • The Nsight Compute metrics are correct. This is collect using a completely different library.

    The NVPROF l1tex_cache_hit_rate needs to be investigated.
    The NVPROF l2_tex_read_hit_rate does not include REDs in the calculation. The read portion implies on loads. As such the 0% is correct.
    The PerfWorks lts__request_total_sectors_hitrate_pct includes all L2 access (display engine, loads, stores, atomics, reductions, etc.).

    Thanks for the help Greg. I tried your kernel on the P100 and still the global_hit_rate is still 50%! These seems to be some kind of bug isn’t it? I can’t use Nsight since it is not supported on the P100:(

    Isn’t The P100 using the GP100 processor? because in the Release notes (2.2) it says that it is not supported!! https://docs.nvidia.com/nsight-compute/ReleaseNotes/index.html#gpu-support.

    Is it a mistake in docs? Can’t test this for the next few days:(

    Sorry, I was able to run nsight compute successfully on my Tesla P100, so I assumed it was supported. I retract my previous statement.

    UUM but that means it works for the P100. Is there a way how to ask Nvidia to tell me what the risks are in using NSight on a P100? Maybe it is just a missed out change in documentation because GP10x are supported

    Interestingly enough I found out that sidghtVisual Studio supports the P100. https://developer.nvidia.com/nsight-visual-studio-edition-supported-gpus-full-list. Does this product use the same profiler as the Nsight Compute?