Parallelism On Multiple Blocks Seems Broken

Okay, so I downloaded CUDA, my GPU is an RTX 2060 and I followed NVIDIA’s official tutorial.

So for one thread, the code is:

#include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
    y[i] = x[i] + y[i];
}
int main(void)
{
  int N = 1<<20;
  float *x, *y;
  // Allocate Unified Memory – accessible from CPU or GPU
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));
  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }
  // Run kernel on 1M elements on the GPU
  add<<<1, 1>>>(N, x, y);
  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();
  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;
  // Free memory
  cudaFree(x);
  cudaFree(y);
    return 0;
}

And using nvprof gives me 50ms.

For 256 threads, and one block, the code is:

#include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
  int index = threadIdx.x;
  int stride = blockDim.x;
  for (int i = index; i < n; i += stride)
      y[i] = x[i] + y[i];
}
int main(void)
{
  int N = 1<<20;
  float *x, *y;
  // Allocate Unified Memory – accessible from CPU or GPU
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));
  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }
  // Run kernel on 1M elements on the GPU
  add<<<1, 256>>>(N, x, y);
  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();
  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;
  // Free memory
  cudaFree(x);
  cudaFree(y);
    return 0;
}

Now nvprof gives me 2ms. Which is all fine and good. But when I try to execute multiple threads AND multiple blocks, nvprof shows no performance increase at all between the code with just multiple threads and the one with multiple threads and blocks. Here is the multiple-block code:

#include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  for (int i = index; i < n; i += stride)
    y[i] = x[i] + y[i];
}
int main(void)
{
  int N = 1<<20;
  float *x, *y;
  // Allocate Unified Memory – accessible from CPU or GPU
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));
  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }
  // Run kernel on 1M elements on the GPU
  int blockSize = 256;
  int numBlocks = (N + blockSize - 1) / blockSize;
  add<<<numBlocks, blockSize>>>(N, x, y);
  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();
  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;
  // Free memory
  cudaFree(x);
  cudaFree(y);
    return 0;
}

This code should execute on multiple blocks but the performance is still the same as the previous code. Does anyone know why? Can anyone tell me why? It would be much appreciated. I am using the code from this tutorial:

https://developer.nvidia.com/blog/even-easier-introduction-cuda/

For whatever reason, the “global” word becomes bold, but it should read as “global

are you on windows or linux?

please read here about code formatting

Thanks for the link, and I am on Linux. I will try to reformat my code to be more legible.

There is an issue with that tutorial in connection with Unified Memory. The two GPUs where performance is quoted in that tutorial (GT 740, Kepler K80) are both Pre-pascal GPUs and operate in a pre-pascal UM regime. You can read more about it in the UM section of the programming guide. Specifically, this means that UM allocations are transferred en-masse to the GPU at the point of kernel launch. Therefore the kernel code exhibits no page faulting activity.

On your Turing GPU, however, the UM regime is a post-pascal regime, allowing for demand-paged transfer of data to the GPU. This is great, but it can have a negative performance impact. You can “rectify” this issue by inserting the following lines of code immediately prior to the kernel launch:

cudaMemPrefetchAsync(x, N*sizeof(float), 0);
cudaMemPrefetchAsync(y, N*sizeof(float), 0);

This will transfer the data to the GPU prior to the kernel launch, so no page-faulting activity takes place during kernel execution. You should then witness execution times in the low 10’s of microseconds on your GPU, in nvprof. Also you will see differences in nvprof reporting of data transfer and page-faulting activity.

You can read additional commentary here.

1 Like

Thank you very much Robert.