Size limit on dynamic allocated shared memory

I have this simple piece of code:

dim3 gridDim(count_useful / numGroupPerBlock,1);

			dim3 blockDim(numGroupPerBlock,8);

			printf("Size:%d\n",5*numGroupPerBlock*8*sizeof(u64));

			processParallelMessagesBalanced<<<gridDim,blockDim,5*numGroupPerBlock*8*sizeof(u64)>>>

			(d_chunkPointer,numBlocks_loc[0],d_numText,numGroupPerBlock);

			printf("Error: %s\n",cudaGetErrorString(cudaGetLastError()));

The variables have the following values:

count_useful = 32

numGroupPerBlock = 32

The program prints:

Size: 10240

Error:out of memory

In the kernel code there is a variable declared as following;

extern shared u64 balanced;

which I use for computation. From what I know there is a size limit on shared memory of 16KB so why the program gives me such an error?

Thanks for replies.

EDIT: the message is not “out of memory” but “too many resources requested for launch”

You ask for 8*32 threads. That means each thread cannot use more than 8192/(256)=32 registers per thread. You probably use more than that in your kernel.

Ok I solved just before reading your answer: I use 40 registers per thread.

Thanks for help External Media