Hello all,

Another CUDA newbie here; your assistance and patience are appreciated.

I’m running a calculation on a 3D array of data which is allocated as linear vector in memory:

```
// Allocate Memory
unsigned int domainSize = Ni * Nj * Nk;
unsigned int memSize = sizeof(float) * domainSize;
float* p = (float*)malloc(memSize);
float* cudaP;
cudaMalloc((void**)&cudaP, memSize);
cudaMemcpy(cudaP, p, memSize, cudaMemcpyHostToDevice);
```

Ni, Nj and Nk are normally around 180 points each, and there are two other arrays which are similarly allocated.

Execution parameters are defined as follows:

```
int threadsInX = 8;
int threadsInY = 8;
int threadsInZ = 8;
int blocksInX = (Ni+threadsInX-1)/threadsInX;
int blocksInY = (Nj+threadsInY-1)/threadsInY;
int blocksInZ = (Nk+threadsInZ-1)/threadsInZ;
dim3 Dg = dim3(blocksInX, blocksInY*blocksInZ);
dim3 Db = dim3(threadsInX, threadsInY, threadsInZ);
```

And the data is accessed from inside the main kernel like this:

```
unsigned int blockIdxz = __float2uint_rd(blockIdx.y * invBlocksInY);
unsigned int blockIdxy = blockIdx.y - __umul24(blockIdxz, blocksInY);
float pC = (float)1/(float)3;
// Resolve i,j,k indices
unsigned int i = __umul24(blockIdx.x, blockDim.x) + threadIdx.x;
unsigned int j = __umul24(blockIdxy, blockDim.y) + threadIdx.y;
unsigned int k = __umul24(blockIdxz, blockDim.z) + threadIdx.z;
p[IDX(i,j,k,Nj,Nk)] = pC*(pz[IDX(i+1,j,k,Nj,Nk)] + pz[IDX(i-1,j,k,Nj,Nk)] + pz[IDX(i,j+1,k,Nj,Nk)] + ... etc...
// Wait for everyone
__syncthreads();
```

Where IDX is just a macro to access linear memory:

```
#define IDX(i,j,k,Nj,Nk) ((((i)) * (Nj) * (Nk)) + (((j)) * (Nk)) + ((k)))
```

PROBLEM DESCRIPTION:

When my data size is below 192 x 192 x 192, i.e. 7077888 points total which results in approx 27MB per array, everything works like a charm.

However, when I cross that threshold and make my data size a little bigger, the kernel throws an exception:

```
cutilCheckMsg() CUTIL CUDA error : Kernel execution failed : invalid device function .
```

Running in EmuDebug shows that this happens the moment the kernel is executed.

My device/code uses 13 registers per thread block, so we are way below the 8192 limit.

Also, I am not explicitly using any shared memory declarations in my code.

Anyone has any ideas to point me at the right direction?

Any help would be greatly appreciated.

Regards,

Jon.