P100 global_hit_rate and and tex_cache_hit_rate

ddmusc · October 29, 2018, 8:41pm

In relation to another post I have opened, I would like to understand how the two topic mentioned metrics are calculated. I wrote this small kernel

__global__ void test(
        float * in, float2 *out)
{
        int x=threadIdx.x;
                float a=__ldg(in+x);
                atomicAdd(&out[x].x,a); //      out[x]=(float)(x*j);
                atomicAdd(&out[x].y,a);
}

and executed over a P100 with 1 block and 32 threads (ie one warp)

The sass looks like

code for sm_60
                Function : _Z4testPfP6float2
        .headerflags    @"EF_CUDA_SM60 EF_CUDA_PTX_SM(EF_CUDA_SM60)"
                                                                                /* 0x083fc400e3e007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20];                        /* 0x4c98078000870001 */
        /*0010*/                   S2R R0, SR_TID.X;                            /* 0xf0c8000002170000 */
        /*0018*/                   ISCADD R4.CC, R0.reuse, c[0x0][0x140], 0x2;  /* 0x4c18810005070004 */
                                                                                /* 0x001dc400fc4007ec */
        /*0028*/                   SHR R5, R0, 0x1e;                            /* 0x3829000001e70005 */
        /*0030*/                   IADD.X R5, R5, c[0x0][0x144];                /* 0x4c10080005170505 */
        /*0038*/                   LDG.E.CI R4, [R4];                           /* 0xeed4a00000070404 */
                                                                                /* 0x001f8800fd8207f1 */
        /*0048*/                   ISCADD R2.CC, R0.reuse, c[0x0][0x148], 0x3;  /* 0x4c18818005270002 */
        /*0050*/                   SHR R0, R0, 0x1d;                            /* 0x3829000001d70000 */
        /*0058*/                   IADD.X R3, R0, c[0x0][0x14c];                /* 0x4c10080005370003 */
                                                                                /* 0x001ffc001e2040f1 */
        /*0068*/                   RED.E.ADD.F32.FTZ.RN [R2], R4;               /* 0xebf9000000370204 */
        /*0070*/                   RED.E.ADD.F32.FTZ.RN [R2+0x4], R4;           /* 0xebf9000040370204 */
        /*0078*/                   EXIT;                                        /* 0xe30000000007000f */
                                                                                /* 0x001f8000fc0007ff */
        /*0088*/                   BRA 0x80;                                    /* 0xe2400fffff07000f */
        /*0090*/                   NOP;                                         /* 0x50b0000000070f00 */
        /*0098*/                   NOP;                                         /* 0x50b0000000070f00 */
                                                                                /* 0x001f8000fc0007e0 */
        /*00a8*/                   NOP;                                         /* 0x50b0000000070f00 */
        /*00b0*/                   NOP;                                         /* 0x50b0000000070f00 */
        /*00b8*/                   NOP;                                         /* 0x50b0000000070f00 */
                ............................

When profiling I get this value as Hit rates

      1                           global_hit_rate                     Global Hit Rate in unified l1/tex      50.00%      50.00%      50.00%
      1                        tex_cache_hit_rate                                Unified Cache Hit Rate      83.33%      83.33%      83.33%
      1                      l2_tex_read_hit_rate                           L2 Hit Rate (Texture Reads)       0.00%       0.00%       0.00%

I can’t understand these rates. For example, why global_hit_rate is 50%. There is only one load, it should be 0%. AtomicAdds are clearly affecting the tex_cache_hit_rate but I can’t find any reason why it should be 83.33%

Any help to clear up the matter, really appreciated

Greg · November 1, 2018, 5:39am

If I run a slightly modified version of vector add with 10,000,000 elements and the kernel

__global__ void
test(const float* __restrict__ in, float2 *out, int numElements)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements)
    {
		float a = in[i];
        atomicAdd(&out[i].x, a);
        atomicAdd(&out[i].y, a);
    }
}

I have similar access pattern and the const restrict results in LDG.E.CI.

Nsight Compute on GTX1080 (GP104) PerfWorks metrics report

tex__hitrate_pct 0%

lts__request_total_sectors_hitrate_pct 40%

NVPROF on GTX1080 reports

global_hit_rate 0.00%

tex_cache_hit_rate 83.33%

l2_tex_read_hit_rate 0.00%

The Nsight Compute metrics are correct. This is collect using a completely different library.

The NVPROF l1tex_cache_hit_rate needs to be investigated.
The NVPROF l2_tex_read_hit_rate does not include REDs in the calculation. The read portion implies on loads. As such the 0% is correct.
The PerfWorks lts__request_total_sectors_hitrate_pct includes all L2 access (display engine, loads, stores, atomics, reductions, etc.).

ddmusc · November 1, 2018, 6:56pm

Greg @ NV:

If I run a slightly modified version of vector add with 10,000,000 elements and the kernel
__global__ void
test(const float* __restrict__ in, float2 *out, int numElements)
{
    int i = blockDim.x * blockIdx.x + threadIdx.x;

    if (i < numElements)
    {
		float a = in[i];
        atomicAdd(&out[i].x, a);
        atomicAdd(&out[i].y, a);
    }
}
I have similar access pattern and the const restrict results in LDG.E.CI.

Nsight Compute on GTX1080 (GP104) PerfWorks metrics report

tex__hitrate_pct 0%

lts__request_total_sectors_hitrate_pct 40%

NVPROF on GTX1080 reports

global_hit_rate 0.00%

tex_cache_hit_rate 83.33%

l2_tex_read_hit_rate 0.00%

The Nsight Compute metrics are correct. This is collect using a completely different library.

The NVPROF l1tex_cache_hit_rate needs to be investigated.
The NVPROF l2_tex_read_hit_rate does not include REDs in the calculation. The read portion implies on loads. As such the 0% is correct.
The PerfWorks lts__request_total_sectors_hitrate_pct includes all L2 access (display engine, loads, stores, atomics, reductions, etc.).

Thanks for the help Greg. I tried your kernel on the P100 and still the global_hit_rate is still 50%! These seems to be some kind of bug isn’t it? I can’t use Nsight since it is not supported on the P100:(

ddmusc · November 3, 2018, 8:53pm

Isn’t The P100 using the GP100 processor? because in the Release notes (2.2) it says that it is not supported!! https://docs.nvidia.com/nsight-compute/ReleaseNotes/index.html#gpu-support.

Is it a mistake in docs? Can’t test this for the next few days:(

Robert_Crovella · November 3, 2018, 9:34pm

Sorry, I was able to run nsight compute successfully on my Tesla P100, so I assumed it was supported. I retract my previous statement.

ddmusc · November 4, 2018, 12:04am

UUM but that means it works for the P100. Is there a way how to ask Nvidia to tell me what the risks are in using NSight on a P100? Maybe it is just a missed out change in documentation because GP10x are supported

ddmusc · November 4, 2018, 12:09am

Interestingly enough I found out that sidghtVisual Studio supports the P100. https://developer.nvidia.com/nsight-visual-studio-edition-supported-gpus-full-list. Does this product use the same profiler as the Nsight Compute?

Regards
Daniel

Topic		Replies	Views
L2 cache rate profiled in nsight compute is confused Nsight Compute	5	3090	July 3, 2024
Nsight and nvprof results have large differences Nsight Compute	9	1182	November 26, 2019
Using Nsight Compute to Inspect your Kernels Technical Blog	2	1685	August 31, 2020
Weird Number for L2 Cache Hitrate Nsight Compute nsight	1	1399	April 25, 2020
P100 unified cache behaviour and how to disable it CUDA Programming and Performance	8	2035	October 25, 2018
bandwidthTest example throws cudaErrorCallRequiresNewerDriver error when launched via nv-nsight-cu-cli Linux nsight , driver	4	573	January 11, 2024
L2 cache hit rate of a streaming kernel is not as expected profiled in ncu CUDA Programming and Performance nsight	2	934	March 22, 2023
Quadro GP100 and Quadro GV100 Support in 2020.1? Nsight Graphics	8	771	October 12, 2021
Can't Get NCU GUI To Import Properly Nsight Compute	8	1349	October 5, 2020
bandwidthTest example throws cudaErrorCallRequiresNewerDriver error when launched via nv-nsight-cu-cli Nsight Compute linux , driver	17	1324	February 9, 2024

P100 global_hit_rate and and tex_cache_hit_rate

Related topics