Example not working on more than million elements

fofo · June 9, 2017, 6:35am

Hi,

I am new to CUDA and I have the problem that somehow the kernel doesn’t seem to execute for more than million elements. I’ve tried it also on the simple example from:

__global__
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
    y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20;
  float *x, *y;

  // Allocate Unified Memory – accessible from CPU or GPU
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }
// Run kernel on 1M elements on the GPU
  add<<<1, 1>>>(N, x, y);

  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  cudaFree(x);
  cudaFree(y);
  
  return 0;
}

If I raise N to more than 1 million, y remains empty. I also checked for an error using cudaGetLastError() but the return value is “cudaSuccess”.
I tested it on two different Systems, both Windows, VS 2015 and not the newest graphics cards: GTX980 and GT755m. On both Systems I have the same problem.

Can someone tell me what limitation is causing this?

Cheers

Robert_Crovella · June 11, 2017, 5:00pm

Your code is incomplete and does not show any kernel call.

My best guess would be you are hitting a WDDM TDR timeout. If you’re not sure what that is, please use google.

You might also be compiling your code for an incorrect architecture. You may get useful info if you run your code with cuda-memcheck.

cbuchner1 · June 12, 2017, 2:11pm

you sure, txbob?

_global__ void add(int n, float *x, float *y)
{
}

 add<<<1, 1>>>(N, x, y);

that looks like a kernel call to me.

1 block, 1 thread per block. The most inefficient way, completely serial execution.

He’s likely running into a timeout with this setup.

fofo · June 12, 2017, 2:15pm

Thanks for the replies. Disabling WDDM TDR has resolved the issue, so timeout seemed to be the cause.
(I just updated the code, I forgot to copy paste the whole thing)

Topic		Replies	Views
Limitation to number of loop iterations? CUDA Programming and Performance	3	3509	June 6, 2011
limit of computation CUDA Programming and Performance	44	33292	April 8, 2008
Array bigger than 10000x10000 elements CUDA Programming and Performance	0	933	October 1, 2017
Weird behavior of CUDA CUDA Programming and Performance	6	5688	February 13, 2008
Indexing Errors with a large array CUDA Programming and Performance	3	2267	February 24, 2009
Cuda Memory transfer limit CUDA Programming and Performance	2	921	August 31, 2016
Code does not run with larger file CUDA Programming and Performance	2	921	October 17, 2017
problem with more data CUDA Programming and Performance	1	10459	October 29, 2011
help: global subroutine not executed when n is too large in <<<x, n>>> CUDA Programming and Performance	4	552	December 6, 2017
CUDA Kernel does not do anything CUDA Programming and Performance	6	2088	February 10, 2021

Example not working on more than million elements

Related topics