Example cuda program very slow on GTx1070

Hello all,

I am running the code from the example https://developer.nvidia.com/blog/even-easier-introduction-cuda/. Nvprof is telling me that to run this example’s add.cu it is taking about 3.5ms, while it took the author 680us to run it on his MacBook’s GTX750M. 3.5ms does not seem right for a discrete desktop 1070 GPU.

Attached below is the code I am using, I basically copied it verbatim from the example:

#include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  for (int i = index; i < n; i += stride)
    y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20;
  float *x, *y;

  // Allocate Unified Memory – accessible from CPU or GPU
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  int blockSize = 256;
  int numBlocks = (N + blockSize - 1) / blockSize;

  // Run kernel on 1M elements on the GPU
  add<<<numBlocks, blockSize>>>(N, x, y);

  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  cudaFree(x);
  cudaFree(y);
  
  return 0;
}

I am using the following command to compile:

/usr/local/cuda-11.6/bin/nvcc add.cu -g -o cudaAdd

My results from nvprof:

matthew@sonOfAnton:~/cudaSandbox/simpleCudaExample$ /usr/local/cuda-11/bin/nvprof ./cudaAdd
==3514== NVPROF is profiling process 3514, command: ./cudaAdd
Max error: 0
==3514== Profiling application: ./cudaAdd
==3514== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  4.5185ms         1  4.5185ms  4.5185ms  4.5185ms  add(int, float*, float*)
      API calls:   95.96%  133.33ms         2  66.663ms  100.06us  133.23ms  cudaMallocManaged
                    3.25%  4.5196ms         1  4.5196ms  4.5196ms  4.5196ms  cudaDeviceSynchronize
                    0.47%  655.75us         2  327.88us  281.24us  374.51us  cudaFree
                    0.25%  342.70us       101  3.3930us     410ns  200.93us  cuDeviceGetAttribute
                    0.03%  43.863us         1  43.863us  43.863us  43.863us  cudaLaunchKernel
                    0.03%  41.799us         1  41.799us  41.799us  41.799us  cuDeviceGetName
                    0.00%  6.0910us         1  6.0910us  6.0910us  6.0910us  cuDeviceGetPCIBusId
                    0.00%  4.1280us         3  1.3760us     711ns  2.3750us  cuDeviceGetCount
                    0.00%  2.2040us         2  1.1020us     420ns  1.7840us  cuDeviceGet
                    0.00%     832ns         1     832ns     832ns     832ns  cuDeviceTotalMem
                    0.00%     742ns         1     742ns     742ns     742ns  cuDeviceGetUuid

==3514== Unified Memory profiling result:
Device "NVIDIA GeForce GTX 1070 (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
      82  99.902KB  4.0000KB  996.00KB  8.000000MB  974.4760us  Host To Device
      24  170.67KB  4.0000KB  0.9961MB  4.000000MB  341.4110us  Device To Host
      12         -         -         -           -  4.460260ms  Gpu page fault groups
Total CPU Page faults: 36

Data from running nvidia-smi :

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:07:00.0  On |                  N/A |
| 28%   29C    P8     6W / 151W |     37MiB /  8192MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1065      G   /usr/lib/xorg/Xorg                 35MiB |
+-----------------------------------------------------------------------------+

Any help would be appreciated, thank you! :)

Answered my own question. This blog post explains it fairly well, but here’s the TLDR version. If you have a pascal GPU (10 series) or later you must do some extra steps for the most performance. Not sure if the solution is backwards compatible with maxwel GPUs, particularly the Jetson Nano.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.