"Invalid Device Function", why?

Hello all,

Another CUDA newbie here; your assistance and patience are appreciated.

I’m running a calculation on a 3D array of data which is allocated as linear vector in memory:

// Allocate Memory

	unsigned int domainSize = Ni * Nj * Nk;

	unsigned int memSize = sizeof(float) * domainSize;

	float* p = (float*)malloc(memSize);

		float* cudaP;

		cudaMalloc((void**)&cudaP, memSize);	

	cudaMemcpy(cudaP, p, memSize, cudaMemcpyHostToDevice);

Ni, Nj and Nk are normally around 180 points each, and there are two other arrays which are similarly allocated.

Execution parameters are defined as follows:

int threadsInX = 8;

		int threadsInY = 8;

		int threadsInZ = 8;

		int blocksInX = (Ni+threadsInX-1)/threadsInX;

		int blocksInY = (Nj+threadsInY-1)/threadsInY;

		int blocksInZ = (Nk+threadsInZ-1)/threadsInZ;

		dim3 Dg = dim3(blocksInX, blocksInY*blocksInZ);

		dim3 Db = dim3(threadsInX, threadsInY, threadsInZ);

And the data is accessed from inside the main kernel like this:

unsigned int blockIdxz = __float2uint_rd(blockIdx.y * invBlocksInY);

	unsigned int blockIdxy = blockIdx.y - __umul24(blockIdxz, blocksInY);

	float pC = (float)1/(float)3;

	// Resolve i,j,k indices

	unsigned int i = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;

	unsigned int j = __umul24(blockIdxy, blockDim.y) + threadIdx.y;

	unsigned int k = __umul24(blockIdxz, blockDim.z) + threadIdx.z;

	p[IDX(i,j,k,Nj,Nk)] = pC*(pz[IDX(i+1,j,k,Nj,Nk)] + pz[IDX(i-1,j,k,Nj,Nk)] + pz[IDX(i,j+1,k,Nj,Nk)] + ... etc...

	// Wait for everyone	


Where IDX is just a macro to access linear memory:

#define IDX(i,j,k,Nj,Nk) ((((i)) * (Nj) * (Nk)) + (((j)) * (Nk)) + ((k)))


When my data size is below 192 x 192 x 192, i.e. 7077888 points total which results in approx 27MB per array, everything works like a charm.

However, when I cross that threshold and make my data size a little bigger, the kernel throws an exception:

cutilCheckMsg() CUTIL CUDA error : Kernel execution failed : invalid device function .

Running in EmuDebug shows that this happens the moment the kernel is executed.

My device/code uses 13 registers per thread block, so we are way below the 8192 limit.

Also, I am not explicitly using any shared memory declarations in my code.

Anyone has any ideas to point me at the right direction?

Any help would be greatly appreciated.