nsight-compute's profiling result is different from nvprof's

I use nv-nsight-cu & nvprof to measure running time of two kernels (kernel1 & kernel2)

With nvprof, the running time is: (command: nvprof ./a.out)

kernel1: 96us
kernel2: 85us

With nsight compute, the running time is:

kernel1: 96us
kernel2: 57us

Moreover, the L2 => Device memory data in nsight compute seems weired.

My settings:
GPU: RTX 2070
CUDA: 10.0.130
Driver: 410.78
OS: Ubuntu 18.04

One possible reason for the discrepency is that Nsight Compute invalidates all caches during the data collection, while nvvp does not invalidate the L2 cache. That might also cause the differences in numbers for L2 -> Device.

Can you please attach a report from each tool so that we can investigate further?

Here is the code I use:

#include <cuda_fp16.h>
#include <stdio.h>

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
   if (code != cudaSuccess) 
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);

void stg_4_32(float* output){
    int tid = threadIdx.x + blockDim.x * blockIdx.x;

    float a = 1.0f;

    output[tid] = a;

void stg_2_32(half* output){
    int tid = threadIdx.x + blockDim.x * blockIdx.x;

    half a = __float2half(1.0f);

    output[tid] = a;

void warming_up(float* x){
    x[0] += 1.0f;

int main(){
    float* out_f;
    half*  out_h;
    float* dummy;

const int num_t = 512;
    const int num_b = 4096 * 4;

gpuErrchk(cudaMalloc((void**)&out_f, num_t * num_b * sizeof(float)));
    gpuErrchk(cudaMalloc((void**)&out_h, num_t * num_b * sizeof(half)));
    gpuErrchk(cudaMalloc((void**)&dummy, sizeof(float)));

warming_up<<<1, 1>>>(dummy);
    stg_4_32<<<num_b, num_t>>>(out_f);
    stg_2_32<<<num_b, num_t>>>(out_h);

    return 0;

Compile with

nvcc -arch=sm_75 main.cu

And I just want to know, if I want to compare performance of different kernel, which value should I trust?


By default, nvprof does a concurrent kernel trace whereas nv-nsight-cu does a serial trace.

In concurrent trace, the profiler incurs some overhead which is proportional to the number of blocks launched by the kernel, whereas serial trace is not affected by the number of blocks in the kernel.
Hence, it reports slightly more duration for a short running kernel.

For your case, which has small kernels and no concurrent kernels, you can use the serial trace number.
Nvprof also provides an option “ --concurrent-kernels off” to switch to serial trace.

Your answer makes much sense to me. Thank you.