Parallelism On Multiple Blocks Seems Broken

borderlinevegan · March 13, 2021, 9:58pm

Okay, so I downloaded CUDA, my GPU is an RTX 2060 and I followed NVIDIA’s official tutorial.

So for one thread, the code is:

#include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
    y[i] = x[i] + y[i];
}
int main(void)
{
  int N = 1<<20;
  float *x, *y;
  // Allocate Unified Memory – accessible from CPU or GPU
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));
  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }
  // Run kernel on 1M elements on the GPU
  add<<<1, 1>>>(N, x, y);
  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();
  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;
  // Free memory
  cudaFree(x);
  cudaFree(y);
    return 0;
}

And using nvprof gives me 50ms.

For 256 threads, and one block, the code is:

#include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
  int index = threadIdx.x;
  int stride = blockDim.x;
  for (int i = index; i < n; i += stride)
      y[i] = x[i] + y[i];
}
int main(void)
{
  int N = 1<<20;
  float *x, *y;
  // Allocate Unified Memory – accessible from CPU or GPU
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));
  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }
  // Run kernel on 1M elements on the GPU
  add<<<1, 256>>>(N, x, y);
  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();
  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;
  // Free memory
  cudaFree(x);
  cudaFree(y);
    return 0;
}

Now nvprof gives me 2ms. Which is all fine and good. But when I try to execute multiple threads AND multiple blocks, nvprof shows no performance increase at all between the code with just multiple threads and the one with multiple threads and blocks. Here is the multiple-block code:

#include <iostream>
#include <math.h>
// Kernel function to add the elements of two arrays
__global__
void add(int n, float *x, float *y)
{
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;
  for (int i = index; i < n; i += stride)
    y[i] = x[i] + y[i];
}
int main(void)
{
  int N = 1<<20;
  float *x, *y;
  // Allocate Unified Memory – accessible from CPU or GPU
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));
  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }
  // Run kernel on 1M elements on the GPU
  int blockSize = 256;
  int numBlocks = (N + blockSize - 1) / blockSize;
  add<<<numBlocks, blockSize>>>(N, x, y);
  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();
  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;
  // Free memory
  cudaFree(x);
  cudaFree(y);
    return 0;
}

This code should execute on multiple blocks but the performance is still the same as the previous code. Does anyone know why? Can anyone tell me why? It would be much appreciated. I am using the code from this tutorial:

https://developer.nvidia.com/blog/even-easier-introduction-cuda/

borderlinevegan · March 13, 2021, 10:01pm

For whatever reason, the “global” word becomes bold, but it should read as “global”

Robert_Crovella · March 13, 2021, 10:12pm

are you on windows or linux?

Robert_Crovella · March 13, 2021, 10:17pm

please read here about code formatting

borderlinevegan · March 13, 2021, 10:20pm

Thanks for the link, and I am on Linux. I will try to reformat my code to be more legible.

Robert_Crovella · March 13, 2021, 10:27pm

There is an issue with that tutorial in connection with Unified Memory. The two GPUs where performance is quoted in that tutorial (GT 740, Kepler K80) are both Pre-pascal GPUs and operate in a pre-pascal UM regime. You can read more about it in the UM section of the programming guide. Specifically, this means that UM allocations are transferred en-masse to the GPU at the point of kernel launch. Therefore the kernel code exhibits no page faulting activity.

On your Turing GPU, however, the UM regime is a post-pascal regime, allowing for demand-paged transfer of data to the GPU. This is great, but it can have a negative performance impact. You can “rectify” this issue by inserting the following lines of code immediately prior to the kernel launch:

cudaMemPrefetchAsync(x, N*sizeof(float), 0);
cudaMemPrefetchAsync(y, N*sizeof(float), 0);

This will transfer the data to the GPU prior to the kernel launch, so no page-faulting activity takes place during kernel execution. You should then witness execution times in the low 10’s of microseconds on your GPU, in nvprof. Also you will see differences in nvprof reporting of data transfer and page-faulting activity.

You can read additional commentary here.

borderlinevegan · March 14, 2021, 3:09am

Thank you very much Robert.

Topic		Replies	Views
Using unified memory causes system crash CUDA Programming and Performance	28	5788	February 4, 2019
Number of Blocks CUDA Programming and Performance	3	1494	October 15, 2011
Unified memory oversubscription and page faults CUDA Programming and Performance	7	2763	March 23, 2018
Why does this simple program take more time to run on more threads? (nvprof) CUDA Programming and Performance profiling	3	729	October 12, 2021
Cuda code performance CUDA Programming and Performance	14	3100	December 16, 2014
An Even Easier Introduction to CUDA Technical Blog	141	6086	November 28, 2023
floyd on cuda--why so slow? CUDA Programming and Performance	15	5463	May 2, 2009
help to clairfy usage of number of grids and number of blocks in kernal CUDA Programming and Performance	0	611	February 14, 2014
Fewer threads per block = ... faster performance? CUDA Programming and Performance	9	69	December 31, 2024
CUDA invalid records warning CUDA Setup and Installation	10	6205	August 10, 2018

Parallelism On Multiple Blocks Seems Broken

Related topics