Nsight returning incorrect results

Dear all,

I am getting incorrect memory statistics while running an application on both P40 and V100.

Below are the system details:

For P40 GPU:
Cuda version: CUDA 10.1
OS: Ubuntu 18.04

For V100 GPU:
Cuda Version: CUDA 10.0
OS: Ubutnu 16.04


I am running a simple addition program with the modification that all the arrays are pinned on the Host Memory.

Below is the kernel code:

__global__
void add(int n, float *x, float *y, float *z)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) z[i] = x[i] + y[i];
  if (i< n) z[i]++;
}

And this is the main program allocating memory on the host side and providing a pointer of the same to the device.

int main() {
    int N = 1<<10;
    float *x, *y, *z, *d_x, *d_y, *d_z;

    cudaDeviceReset();
    //Allocating memory onto host
    cudaHostAlloc((void **)&x,  N*sizeof(float), cudaHostAllocMapped );
    cudaHostAlloc((void **)&y,  N*sizeof(float), cudaHostAllocMapped );
    cudaHostAlloc((void **)&z,  N*sizeof(float), cudaHostAllocMapped );
    for (int i = 0; i < N; i++) {
        x[i] = 1.0f;
        y[i] = 2.0f;
    }

    //Getting device pointer
    cudaHostGetDevicePointer((void **)&d_x, x, 0);
    cudaHostGetDevicePointer((void **)&d_y, y, 0);
    cudaHostGetDevicePointer((void **)&d_z, z, 0);

    cudaDeviceSynchronize();
    cudaProfilerStart();
    add<<<(N+255)/256, 256>>>(N, d_x, d_y, d_z);
    cudaProfilerStop();
    cudaDeviceSynchronize();

    cudaFreeHost(x);
    cudaFreeHost(y);
    cudaFreeHost(z);
    cudaDeviceReset();
}

Compiling:

nvcc -O0 -Xcicc -O0 -Xptxas -O0 -Xptxas -dlcm=cg -Xptxas -dscm=wb -o [OUTPUT_FILENAME] [FILENAME]


Inconsistent Results obtained:

P40:

Loads and Stores from System Memory: Not Available
Loads from Device Memory : 31.94 KB
Stores to Device Memory : 1.23 MB
(Can’t understand why there are 1.23 MB of stores to Device Memory)

V100:

Shows 0B of loads and stores from System memory, Device memory, L1 cache and L2 cache.


Thanks in advance.
Shweta

Please let us know which exact version of Nsight Compute you are using to collect this data, as well as the display driver version. For the memory metrics you are mentioning, which exact metrics are you referring to, and how to you obtain them (i.e. via UI/Interactive Profile activity, Nsight Compute CLI command line, …)? Do you get those from the memory workload analysis chart? Thank you.

Dear Felix,

For V100:
Nsight version : NsightCompute-1.0
Driver version : 410.79
Command used :
nv-nsight-cu-cli --section MemoryWorkloadAnalysis -f -o [OUTPUT_PROFILE] [EXECUTABLE]

For P40:
Nsight version : NsightCompute-2019.3
Driver Version: 418.67
Command used :
nv-nsight-cu-cli --section MemoryWorkloadAnalysis.* -f -o [OUTPUT_PROFILE] [EXECUTABLE]

For obtaining the metrics on both the system I use NsightCompute CLI.
And I am quoting the results from the memory workload chart analysis.

Thanks a lot!

Hi Shweta,

for the problem you seeing on GV100, I believe the issue is due to the different versions of Nsight Compute you are using for collecting the data and likely for viewing it. I assume you open the GV100 report not with Nsight Compute 1.0, but with Nsight Compute 2019.3 (or later), is that correct?

The set of metrics used for Volta-architecture GPUs changed after Nsight Compute 1.0. The memory diagram in the UI however is not embedded in the report, i.e. the metrics it uses to show the data are defined by the UI. If you open the 1.0 report with a later UI, it will try to find the wrong (new) metric names and will not be able to fill in the chart.

You can either re-collect the data using Nsight Compute 2019.3 or later on GV100, or use Nsight Compute 1.0 UI to view the report, or use any Nsight Compute CLI to read the report and print the raw metrics to the command line, using

nv-nsight-cu-cli --page raw -i <report>

Let me know if this helps, otherwise we’ll need to check more on this issue.
As for the problem you are seeing on P40, we’ll still need to look into this.

Dear Felix,

Thanks for your suggestion. You were correct the issue was because of the difference in the Nsight version I was using to read the results. Now it is giving expected results. Thanks again for your help. I used Nsight 2019.4 on both the machines to compute and read the stats.

Regards,
Shweta