nsight-compute's profiling result is different from nvprof's

dyanab · December 26, 2018, 7:09am

I use nv-nsight-cu & nvprof to measure running time of two kernels (kernel1 & kernel2)

With nvprof, the running time is: (command: nvprof ./a.out)

kernel1: 96us
kernel2: 85us

With nsight compute, the running time is:

kernel1: 96us
kernel2: 57us

Moreover, the L2 => Device memory data in nsight compute seems weired.

My settings:
GPU: RTX 2070
CUDA: 10.0.130
Driver: 410.78
OS: Ubuntu 18.04

rbischof · January 2, 2019, 8:24pm

One possible reason for the discrepency is that Nsight Compute invalidates all caches during the data collection, while nvvp does not invalidate the L2 cache. That might also cause the differences in numbers for L2 → Device.

Can you please attach a report from each tool so that we can investigate further?

dyanab · January 7, 2019, 2:27am

Here is the code I use:

#include <cuda_fp16.h>
#include <stdio.h>

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess) 
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

__global__ 
void stg_4_32(float* output){
    int tid = threadIdx.x + blockDim.x * blockIdx.x;

    float a = 1.0f;

    output[tid] = a;
}

__global__ 
void stg_2_32(half* output){
    int tid = threadIdx.x + blockDim.x * blockIdx.x;

    half a = __float2half(1.0f);

    output[tid] = a;
}

__global__ 
void warming_up(float* x){
    x[0] += 1.0f;
}

int main(){
    float* out_f;
    half*  out_h;
    float* dummy;

const int num_t = 512;
    const int num_b = 4096 * 4;

gpuErrchk(cudaMalloc((void**)&out_f, num_t * num_b * sizeof(float)));
    gpuErrchk(cudaMalloc((void**)&out_h, num_t * num_b * sizeof(half)));
    gpuErrchk(cudaMalloc((void**)&dummy, sizeof(float)));

warming_up<<<1, 1>>>(dummy);
    
    stg_4_32<<<num_b, num_t>>>(out_f);
    stg_2_32<<<num_b, num_t>>>(out_h);
    gpuErrchk(cudaDeviceSynchronize());

    return 0;
}

Compile with

nvcc -arch=sm_75 main.cu

And I just want to know, if I want to compare performance of different kernel, which value should I trust?

Thanks

rbischof · March 13, 2019, 8:49pm

By default, nvprof does a concurrent kernel trace whereas nv-nsight-cu does a serial trace.

In concurrent trace, the profiler incurs some overhead which is proportional to the number of blocks launched by the kernel, whereas serial trace is not affected by the number of blocks in the kernel.
Hence, it reports slightly more duration for a short running kernel.

For your case, which has small kernels and no concurrent kernels, you can use the serial trace number.
Nvprof also provides an option “ --concurrent-kernels off” to switch to serial trace.

dyanab · April 9, 2019, 5:55am

Your answer makes much sense to me. Thank you.

Topic		Replies	Views
Nvprof and Nsight returning different results for L1 and L2 cache hit rates Nsight Compute	4	694	August 13, 2019
Nsight and nvprof results have large differences Nsight Compute	9	1273	November 26, 2019
Nvprof and Nsight returning different results for L1 and L2 cache hit rates Visual Profiler and nvprof	0	847	July 8, 2019
Different result with concurrent kernel displayed in timeline Nsight Visual Studio Edition	5	2148	October 28, 2013
Profiling one application having two concurent kernels Nsight Compute	3	684	June 8, 2023
Huge data file is generated by Nsight Nsight Compute	7	1018	October 23, 2019
kernel runs much faster when being profiled with Visual Profiler Visual Profiler and nvprof	4	4732	August 29, 2014
About DRAM stats Nsight Compute	6	1246	February 21, 2020
Why would code run 1.7x faster when run with nvprof than without? CUDA Programming and Performance	35	3407	December 28, 2017
nv-nsight-cu-cli profiles every kernel 47x, is very slow Profiling Linux Targets	2	1155	October 12, 2021

nsight-compute's profiling result is different from nvprof's

Related topics