Dear all,
I am getting incorrect memory statistics while running an application on both P40 and V100.
Below are the system details:
For P40 GPU:
Cuda version: CUDA 10.1
OS: Ubuntu 18.04
For V100 GPU:
Cuda Version: CUDA 10.0
OS: Ubutnu 16.04
I am running a simple addition program with the modification that all the arrays are pinned on the Host Memory.
Below is the kernel code:
__global__
void add(int n, float *x, float *y, float *z)
{
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) z[i] = x[i] + y[i];
if (i< n) z[i]++;
}
And this is the main program allocating memory on the host side and providing a pointer of the same to the device.
int main() {
int N = 1<<10;
float *x, *y, *z, *d_x, *d_y, *d_z;
cudaDeviceReset();
//Allocating memory onto host
cudaHostAlloc((void **)&x, N*sizeof(float), cudaHostAllocMapped );
cudaHostAlloc((void **)&y, N*sizeof(float), cudaHostAllocMapped );
cudaHostAlloc((void **)&z, N*sizeof(float), cudaHostAllocMapped );
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
//Getting device pointer
cudaHostGetDevicePointer((void **)&d_x, x, 0);
cudaHostGetDevicePointer((void **)&d_y, y, 0);
cudaHostGetDevicePointer((void **)&d_z, z, 0);
cudaDeviceSynchronize();
cudaProfilerStart();
add<<<(N+255)/256, 256>>>(N, d_x, d_y, d_z);
cudaProfilerStop();
cudaDeviceSynchronize();
cudaFreeHost(x);
cudaFreeHost(y);
cudaFreeHost(z);
cudaDeviceReset();
}
Compiling:
nvcc -O0 -Xcicc -O0 -Xptxas -O0 -Xptxas -dlcm=cg -Xptxas -dscm=wb -o [OUTPUT_FILENAME] [FILENAME]
Inconsistent Results obtained:
P40:
Loads and Stores from System Memory: Not Available
Loads from Device Memory : 31.94 KB
Stores to Device Memory : 1.23 MB
(Can’t understand why there are 1.23 MB of stores to Device Memory)
V100:
Shows 0B of loads and stores from System memory, Device memory, L1 cache and L2 cache.
Thanks in advance.
Shweta