some questions about metrics(global_hit_rate,stall_constant_memory_dependency,etc) of nvprof

GPU: Tesla V100-PCIE-16GB
CUDA Version: 9.0.176

#include <stdio.h>

global void saxpy(int n, float a, float x, float y)
int i = blockIdx.x
blockDim.x + threadIdx.x;
if (i < n) y[i] = a
x[i] + y[i];

int main(void)
int N = 1000;
float x, y, d_x, d_y;
x = (float
y = (float

cudaMalloc(&d_x, Nsizeof(float));
cudaMalloc(&d_y, N

for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;

cudaMemcpy(d_x, x, Nsizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_y, y, N
sizeof(float), cudaMemcpyHostToDevice);

// Perform SAXPY on 1M elements
saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y);

cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);

float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = max(maxError, abs(y[i]-4.0f));
printf(“Max error: %f\n”, maxError);


compile cmd: nvcc -arch=sm_70 -o cuda
run cmd: nvprof --print-gpu-trace --metrics all ./cuda
some results:
global_hit_rate Global Hit Rate in unified l1/tex 33.33% 33.33% 33.33%
stall_constant_memory_dependency Issue Stall Reasons (Immediate constant) 40.34% 40.34% 40.34%

I have some questions about the result:
1、From cuda-c-programming-guide, I got the information that global memory in Tesla V100(Compute Capability 7.x) do not use L1/Tex cache, but why global_hit_rate is 33.33%??? Is this should be Global Hit Rate in L2 cache?
2、In kernel saxpy, there is no constant_memory used, but why stall_constant_memory_dependency is not 0%?

Hi, yueye

Have posted your question to nvidia developer, once any response, will let you know ASAP.

Thanks !

Thx! waiting for your reply!

Global Hit Rate

  • All global, local, surface, and texture instructions use the L1TEX cache.
  • y[i] = a*x[i] + y[i]; converts to
/*00a0*/                   LDG.E.SYS R2, [R2] ;  // x[i] always misses
        /*00b0*/                   LDG.E.SYS R0, [R4] ;  // y[i] always misses
        /*00d0*/                   STG.E.SYS [R4], R6 ;  // y[i] always hits

The store aways hits in the L1 cache as you just loaded the cache line.

All references to c[bank][offset] are immediate constant accesses. The launch dimensions you are using are extremely small so the stall rate is very high as you are only launching 32 warps. Increase N to 10,000,000 and the stalls due to IMC will approach 0. For each SM each of the c references will miss. Once in the local IMC cache the rest of the warps will hit.

N          smsp__warp_cycles_per_issue_stall_imc_miss [cycles/issue]
1000       23.29
100000     23.23
1000000     2.57
10000000    0.04
        /*0000*/                   MOV R1, c[0x0][0x28] ;
        /*0010*/                   NOP;
        /*0020*/                   S2R R4, SR_CTAID.X ;
        /*0030*/                   S2R R2, SR_TID.X ;
        /*0040*/                   IMAD R4, R4, c[0x0][0x0], R2 ;
        /*0050*/                   ISETP.GE.AND P0, PT, R4, c[0x0][0x160], PT ;
        /*0060*/               @P0 EXIT ;
        /*0070*/                   MOV R5, 0x4 ;
        /*0080*/                   IMAD.WIDE R2, R4.reuse, R5.reuse, c[0x0][0x168] ;
        /*0090*/                   IMAD.WIDE R4, R4, R5, c[0x0][0x170] ;
        /*00a0*/                   LDG.E.SYS R2, [R2] ;
        /*00b0*/                   LDG.E.SYS R0, [R4] ;
        /*00c0*/                   FFMA R6, R2, c[0x0][0x164], R0 ;
        /*00d0*/                   STG.E.SYS [R4], R6 ;
        /*00e0*/                   EXIT ;
        /*00f0*/                   BRA `(.L_1);