An Even Easier Introduction to CUDA

Hello, I’m new to cuda, also new to C++, i’m having a weird issue with the sample, the GPU code is 3x slower than the CPU one, here is the code (I just changed int N = 1<<20; to int N = 1<<25;)

CPU CODE :

#include <iostream>
#include <math.h>

using namespace std;


// function to add the elements of two arrays
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
      y[i] = x[i] + y[i];
}



int main(void)
{
  unsigned int N = 1<<25; // 1M elements

  std::cout << "Iterations : " << N << "\n";

  float *x = new float[N];
  float *y = new float[N];

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the CPU
  add(N, x, y);

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  delete [] x;
  delete [] y;

  return 0;
}

GPU CODE :

#include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
  int index = threadIdx.x;
  int stride = blockDim.x;
  for (int i = index; i < n; i += stride)
      y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<25;
  float *x, *y;

  // Allocate Unified Memory – accessible from CPU or GPU
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Run kernel on 1M elements on the GPU
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add<<<numBlocks, blockSize>>>(N, x, y);

  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  cudaFree(x);
  cudaFree(y);
  
  return 0;
}

GPU : RTX4800

compile commands :

GPU code :

 nvcc .\test2.cu -o test2 -arch=sm_90a

CPU code :

  nvcc .\test.cu -o test 

CPU RESULT :

PS C:\Users\llefe\CUDA> Measure-command {.\test.exe}


Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 0
Milliseconds      : 892
Ticks             : 8920678
TotalDays         : 1,03248587962963E-05
TotalHours        : 0,000247796611111111
TotalMinutes      : 0,0148677966666667
TotalSeconds      : 0,8920678
TotalMilliseconds : 892,0678

GPU RESULT :

PS C:\Users\llefe\CUDA> Measure-command {.\test2.exe}


Days              : 0
Hours             : 0
Minutes           : 0
Seconds           : 2
Milliseconds      : 882
Ticks             : 28823809
TotalDays         : 3,33608900462963E-05
TotalHours        : 0,000800661361111111
TotalMinutes      : 0,0480396816666667
TotalSeconds      : 2,8823809
TotalMilliseconds : 2882,3809

Posting in case anyone else runs into the issue where increasing the number of blocks beyond 1 fails to improve the profiling / timing reported for the add() call. Comparing the code in the post with the vectorAdd example in the cuda-samples repo, I noticed that the calls for memory allocation were different. In the blog post, it was cudaMallocManaged, whereas in the vectorAdd example, it was cudaMalloc and cudaMemcpy. This made all the difference to the profiler, and, with the latter memory strategy, I saw the expected performance pickup.

==186921== Profiling application: ./add_test
==186921== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   59.39%  1.2993ms         2  649.65us  628.58us  670.72us  [CUDA memcpy HtoD]
                   39.16%  856.87us         1  856.87us  856.87us  856.87us  [CUDA memcpy DtoH]
                    1.45%  31.712us         1  31.712us  31.712us  31.712us  add(int, float*, float*)
==186266== Profiling application: ./add_grid
==186266== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:  100.00%  3.0770ms         1  3.0770ms  3.0770ms  3.0770ms  add(int, float*, float*)

I’m very new to CUDA, and certainly wouldn’t know if things have evolved since the 2017 post, but is this the expected behavior in late 2023?

thanks!
Dustin

So I’m trying to run the collab notebook linked to from this course (direct link to the collab notebook) and the GPU parts just aren’t working. The CPU program works fine, but for all of the GPU programs (without any changes to the code) you end up with a maxError of 1 and if you print out the contents of y it’s all just 2 even after the kernel has been run and the synchronisation called. Feels like the kernel either isn’t actually getting run, or the changes made on the GPU aren’t actually reflected on the CPU.

Had to change the Mem prefetch code little bit for cuda 13.0
cudaMemLocation hostLoc;
hostLoc.type = cudaMemLocationTypeHost;
cudaMemPrefetchAsync(x, N*sizeof(float), hostLoc, 0);
cudaMemPrefetchAsync(y, N*sizeof(float), hostLoc, 0);
But I didn’t notice significant difference after adding prefetching.
Without Pre fetching
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Category Operation
75.4 1381541 1 1381541.0 1381541.0 1381541 1381541 0.0 CUDA_KERNEL add(int, float *, float *)
20.3 371721 139 2674.3 959.0 767 32288 5376.0 MEMORY_OPER [CUDA memcpy Unified Host-to-Device]
4.4 79828 24 3326.2 1103.5 383 18400 5299.6 MEMORY_OPER [CUDA memcpy Unified Device-to-Host]

With Pre fetching
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Category Operation
75.4 1381541 1 1381541.0 1381541.0 1381541 1381541 0.0 CUDA_KERNEL add(int, float *, float *)
20.3 371721 139 2674.3 959.0 767 32288 5376.0 MEMORY_OPER [CUDA memcpy Unified Host-to-Device]
4.4 79828 24 3326.2 1103.5 383 18400 5299.6 MEMORY_OPER [CUDA memcpy Unified Device-to-Host]