Example not working on more than million elements

Hi,

I am new to CUDA and I have the problem that somehow the kernel doesn’t seem to execute for more than million elements. I’ve tried it also on the simple example from:

__global__
void add(int n, float *x, float *y)
{
  for (int i = 0; i < n; i++)
    y[i] = x[i] + y[i];
}

int main(void)
{
  int N = 1<<20;
  float *x, *y;

  // Allocate Unified Memory – accessible from CPU or GPU
  cudaMallocManaged(&x, N*sizeof(float));
  cudaMallocManaged(&y, N*sizeof(float));

  // initialize x and y arrays on the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }
// Run kernel on 1M elements on the GPU
  add<<<1, 1>>>(N, x, y);

  // Wait for GPU to finish before accessing on host
  cudaDeviceSynchronize();

  // Check for errors (all values should be 3.0f)
  float maxError = 0.0f;
  for (int i = 0; i < N; i++)
    maxError = fmax(maxError, fabs(y[i]-3.0f));
  std::cout << "Max error: " << maxError << std::endl;

  // Free memory
  cudaFree(x);
  cudaFree(y);
  
  return 0;
}

If I raise N to more than 1 million, y remains empty. I also checked for an error using cudaGetLastError() but the return value is “cudaSuccess”.
I tested it on two different Systems, both Windows, VS 2015 and not the newest graphics cards: GTX980 and GT755m. On both Systems I have the same problem.

Can someone tell me what limitation is causing this?

Cheers

Your code is incomplete and does not show any kernel call.

My best guess would be you are hitting a WDDM TDR timeout. If you’re not sure what that is, please use google.

You might also be compiling your code for an incorrect architecture. You may get useful info if you run your code with cuda-memcheck.

you sure, txbob?

_global__ void add(int n, float *x, float *y)
{
}

 add<<<1, 1>>>(N, x, y);

that looks like a kernel call to me.

1 block, 1 thread per block. The most inefficient way, completely serial execution.

He’s likely running into a timeout with this setup.

Thanks for the replies. Disabling WDDM TDR has resolved the issue, so timeout seemed to be the cause.
(I just updated the code, I forgot to copy paste the whole thing)