blocksize causes kernel error

Hi, I’m currently experiencing a strange behavior on my new geforce 540m:

I have written a very simple kernel to add vectors. When i put in pretty large vectors, it crashes, depending on the blocksize, e.g.:

BS: 256 Works with ~1.5 million floating point values as input, but crashes at 2 million

BS: 512 Works with ~16 million values, but crashes at 32 million

BS: 1024 Works until my main application says “bad_alloc” …probably no cuda problem ;)

Does anyone know about this problem or has a solution?

#include "cuda_add.h"

__global__ void addiere(float *a, float *b, float *c, unsigned long count)


	unsigned long idx=blockIdx.x * blockDim.x + threadIdx.x;	

	if (idx<count)	





void cu_addiere(float *pfA, float *pfB, float *pfC, unsigned long ulCount, struct etime *sTime)


	cudaEvent_t start_function, start_kernel, stop_kernel, stop_function;







	float *d_pfA=NULL;

	float *d_pfB=NULL;

	float *d_pfC=NULL;


	cudaMalloc((void**)&d_pfA, sizeof(float)*ulCount);

	cudaMalloc((void**)&d_pfB, sizeof(float)*ulCount);

	cudaMalloc((void**)&d_pfC, sizeof(float)*ulCount);


	cudaMemcpy(d_pfA, pfA, sizeof(float)*ulCount, cudaMemcpyHostToDevice);

	cudaMemcpy(d_pfB, pfB, sizeof(float)*ulCount, cudaMemcpyHostToDevice);





	addiere<<<ulCount/1024+1, 1024>>>(d_pfA,d_pfB,d_pfC,ulCount);




	cudaMemcpy(pfC, d_pfC, sizeof(float)*ulCount, cudaMemcpyDeviceToHost);





	cudaEventElapsedTime(&(sTime->pre_kernel), start_function, start_kernel);

	cudaEventElapsedTime(&(sTime->kernel), start_kernel, stop_kernel);

	cudaEventElapsedTime(&(sTime->post_kernel), stop_kernel, stop_function);



this is a long shot but try to free the memory when you are done with it in cu_addiere()

cudaFree(d_pfC); etc.

If this doesn’t work, could you post the code where you call cu_addiere().

freeing the memory didn’t helped.

unsigned long count = 1024*1024*16;

	float *a = new float[count];

	float *b = new float[count];

	float *c = new float[count];

    for (int i = 0; i<count; i++)


		a[i] = b[i] = (float)i;

		c[i] = 0.0f;


	etime times_cuda;   




So when you say 32M elements you actually mean 1024102432 elements = 33554432 elements

running this elements with 512 threads per blocks will need 33554432/512 = 65536 blocks but the maximum amount of blocks you could schedule in one dimension is 65535. This is why it crashes with 32M elements but now with 16M.

I could not get a crash using BS: 256 and 2M elements (102410242)

I have no explanation why this would crash.

I was not able to reproduce the “bad_alloc” either.

thanks brano, topic solved.

<- shame on snowball :(