Invalid configuration argument Kernels fail to work with big arrays


I am trying a simple CUDA tutorial I found in this site. It is supposed to square all elements of an array.

The problem is this: the program works well with arrays of 100,000 and 200,000 items, but fails when N reaches bigger numbers. When N is bigger no calculation at all is done on the array, and I get an “Invalid configuration argument” when I execute my Kernel.

I suspect it is something to do with the block_size but I haven’t found the bug yet…

Here is the code:

#include <conio.h>

#include <stdio.h>

#include <cuda.h>

// Kernel that executes on the CUDA device

__global__ void square_array(float *a, int N)


  int idx = blockIdx.x * blockDim.x + threadIdx.x;

  if (idx<N) a[idx] = a[idx] * a[idx];


void checkCUDAError(const char *msg)


    cudaError_t err = cudaGetLastError();

    if( cudaSuccess != err) 


        fprintf(stderr, "Cuda error: %s: %s.\n", msg, 

                                  cudaGetErrorString( err) );



// main routine that executes on the host

int main(void)


	float *a_h, *a_d;  // Pointer to host & device arrays

	int N = 100000;  // Number of elements in arrays

	size_t size = N * sizeof(float);

	a_h = (float *)malloc(size);        // Allocate array on host

	cudaMalloc((void **) &a_d, size);   // Allocate array on device

	//Initialize host array and copy it to CUDA device

	for (int i=0; i<N; i++)

  a_h[i] = (float)i;

	cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);

	// Do calculation on device:

	int block_size = 4;

	int n_blocks = N/block_size + (N%block_size == 0 ? 0:1);

	square_array <<< n_blocks, block_size >>> (a_d, N);

	// Check for any CUDA errors


	// Retrieve result from device and store it in host array

	cudaMemcpy(a_h, a_d, sizeof(float)*N, cudaMemcpyDeviceToHost);

	// Check for any CUDA errors


	// Print results

	for (int i=0; i<1000; i++)

  printf("%d %f\n", i, a_h[i]);

	// Cleanup

	free(a_h); cudaFree(a_d);

	//Wait for key to exit

	printf("Press any key to exit...\n");





Thanks in advance for your help.

You can’t have more than I think about 65000 blocks. And please read up in the programming guide, if you use only 4 threads per block you will have serious performance issues, 16 is the absolute reasonable minimum, normally it should not be less than 64.

Thank you!