Nsight returning incorrect results

pandeyshweta2401 · August 13, 2019, 11:35am

Dear all,

I am getting incorrect memory statistics while running an application on both P40 and V100.

Below are the system details:

For P40 GPU:
Cuda version: CUDA 10.1
OS: Ubuntu 18.04

For V100 GPU:
Cuda Version: CUDA 10.0
OS: Ubutnu 16.04

I am running a simple addition program with the modification that all the arrays are pinned on the Host Memory.

Below is the kernel code:

__global__
void add(int n, float *x, float *y, float *z)
{
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < n) z[i] = x[i] + y[i];
  if (i< n) z[i]++;
}

And this is the main program allocating memory on the host side and providing a pointer of the same to the device.

int main() {
    int N = 1<<10;
    float *x, *y, *z, *d_x, *d_y, *d_z;

    cudaDeviceReset();
    //Allocating memory onto host
    cudaHostAlloc((void **)&x,  N*sizeof(float), cudaHostAllocMapped );
    cudaHostAlloc((void **)&y,  N*sizeof(float), cudaHostAllocMapped );
    cudaHostAlloc((void **)&z,  N*sizeof(float), cudaHostAllocMapped );
    for (int i = 0; i < N; i++) {
        x[i] = 1.0f;
        y[i] = 2.0f;
    }

    //Getting device pointer
    cudaHostGetDevicePointer((void **)&d_x, x, 0);
    cudaHostGetDevicePointer((void **)&d_y, y, 0);
    cudaHostGetDevicePointer((void **)&d_z, z, 0);

    cudaDeviceSynchronize();
    cudaProfilerStart();
    add<<<(N+255)/256, 256>>>(N, d_x, d_y, d_z);
    cudaProfilerStop();
    cudaDeviceSynchronize();

    cudaFreeHost(x);
    cudaFreeHost(y);
    cudaFreeHost(z);
    cudaDeviceReset();
}

Compiling:

nvcc -O0 -Xcicc -O0 -Xptxas -O0 -Xptxas -dlcm=cg -Xptxas -dscm=wb -o [OUTPUT_FILENAME] [FILENAME]

Inconsistent Results obtained:

P40:

Loads and Stores from System Memory: Not Available
Loads from Device Memory : 31.94 KB
Stores to Device Memory : 1.23 MB
(Can’t understand why there are 1.23 MB of stores to Device Memory)

V100:

Shows 0B of loads and stores from System memory, Device memory, L1 cache and L2 cache.

Thanks in advance.
Shweta

felix_dt · August 13, 2019, 12:56pm

Please let us know which exact version of Nsight Compute you are using to collect this data, as well as the display driver version. For the memory metrics you are mentioning, which exact metrics are you referring to, and how to you obtain them (i.e. via UI/Interactive Profile activity, Nsight Compute CLI command line, …)? Do you get those from the memory workload analysis chart? Thank you.

pandeyshweta2401 · August 13, 2019, 5:38pm

Dear Felix,

For V100:
Nsight version : NsightCompute-1.0
Driver version : 410.79
Command used :
nv-nsight-cu-cli --section MemoryWorkloadAnalysis -f -o [OUTPUT_PROFILE] [EXECUTABLE]

For P40:
Nsight version : NsightCompute-2019.3
Driver Version: 418.67
Command used :
nv-nsight-cu-cli --section MemoryWorkloadAnalysis.* -f -o [OUTPUT_PROFILE] [EXECUTABLE]

For obtaining the metrics on both the system I use NsightCompute CLI.
And I am quoting the results from the memory workload chart analysis.

Thanks a lot!

felix_dt · August 15, 2019, 8:44am

Hi Shweta,

for the problem you seeing on GV100, I believe the issue is due to the different versions of Nsight Compute you are using for collecting the data and likely for viewing it. I assume you open the GV100 report not with Nsight Compute 1.0, but with Nsight Compute 2019.3 (or later), is that correct?

The set of metrics used for Volta-architecture GPUs changed after Nsight Compute 1.0. The memory diagram in the UI however is not embedded in the report, i.e. the metrics it uses to show the data are defined by the UI. If you open the 1.0 report with a later UI, it will try to find the wrong (new) metric names and will not be able to fill in the chart.

You can either re-collect the data using Nsight Compute 2019.3 or later on GV100, or use Nsight Compute 1.0 UI to view the report, or use any Nsight Compute CLI to read the report and print the raw metrics to the command line, using

nv-nsight-cu-cli --page raw -i <report>

Let me know if this helps, otherwise we’ll need to check more on this issue.
As for the problem you are seeing on P40, we’ll still need to look into this.

pandeyshweta2401 · August 20, 2019, 7:32am

Dear Felix,

Thanks for your suggestion. You were correct the issue was because of the difference in the Nsight version I was using to read the results. Now it is giving expected results. Thanks again for your help. I used Nsight 2019.4 on both the machines to compute and read the stats.

Regards,
Shweta

Topic		Replies	Views
Nvprof and Nsight returning different results for L1 and L2 cache hit rates Nsight Compute	4	646	August 13, 2019
Latest Nsight Systems and Nvidia Driver aren't compatible? Profiling x86 Windows Targets	21	3671	March 4, 2021
NSight reports different memory address than actual address CUDA Programming and Performance	3	2489	April 11, 2012
Nsight Compute Error Nsight Compute cuda	10	315	August 2, 2024
Nsight system not report unified memory page fault statistics in summery Profiling Linux Targets nsight	3	1910	March 29, 2024
Sqlite does not contain CUDA kernel data CUDA on Windows Subsystem for Linux	12	3717	April 28, 2023
Nsight memory copies discrepancy CUDA Programming and Performance	1	5160	October 5, 2011
What is the meaning of error in Nsight UI Diagnostics Summary Profiling Linux Targets	3	952	February 2, 2023
Nsight system HPC Linux installation nvc, nvc++ and nvfortran	7	1735	August 31, 2021
[Error] Access counter and nsight system with performance counter Profiling Linux Targets cuda	2	64	December 5, 2024

Nsight returning incorrect results

Related topics