Another CUDA newbie here; your assistance and patience are appreciated.
I’m running a calculation on a 3D array of data which is allocated as linear vector in memory:
// Allocate Memory unsigned int domainSize = Ni * Nj * Nk; unsigned int memSize = sizeof(float) * domainSize; float* p = (float*)malloc(memSize); float* cudaP; cudaMalloc((void**)&cudaP, memSize); cudaMemcpy(cudaP, p, memSize, cudaMemcpyHostToDevice);
Ni, Nj and Nk are normally around 180 points each, and there are two other arrays which are similarly allocated.
Execution parameters are defined as follows:
int threadsInX = 8; int threadsInY = 8; int threadsInZ = 8; int blocksInX = (Ni+threadsInX-1)/threadsInX; int blocksInY = (Nj+threadsInY-1)/threadsInY; int blocksInZ = (Nk+threadsInZ-1)/threadsInZ; dim3 Dg = dim3(blocksInX, blocksInY*blocksInZ); dim3 Db = dim3(threadsInX, threadsInY, threadsInZ);
And the data is accessed from inside the main kernel like this:
unsigned int blockIdxz = __float2uint_rd(blockIdx.y * invBlocksInY); unsigned int blockIdxy = blockIdx.y - __umul24(blockIdxz, blocksInY); float pC = (float)1/(float)3; // Resolve i,j,k indices unsigned int i = __umul24(blockIdx.x, blockDim.x) + threadIdx.x; unsigned int j = __umul24(blockIdxy, blockDim.y) + threadIdx.y; unsigned int k = __umul24(blockIdxz, blockDim.z) + threadIdx.z; p[IDX(i,j,k,Nj,Nk)] = pC*(pz[IDX(i+1,j,k,Nj,Nk)] + pz[IDX(i-1,j,k,Nj,Nk)] + pz[IDX(i,j+1,k,Nj,Nk)] + ... etc... // Wait for everyone __syncthreads();
Where IDX is just a macro to access linear memory:
#define IDX(i,j,k,Nj,Nk) ((((i)) * (Nj) * (Nk)) + (((j)) * (Nk)) + ((k)))
When my data size is below 192 x 192 x 192, i.e. 7077888 points total which results in approx 27MB per array, everything works like a charm.
However, when I cross that threshold and make my data size a little bigger, the kernel throws an exception:
cutilCheckMsg() CUTIL CUDA error : Kernel execution failed : invalid device function .
Running in EmuDebug shows that this happens the moment the kernel is executed.
My device/code uses 13 registers per thread block, so we are way below the 8192 limit.
Also, I am not explicitly using any shared memory declarations in my code.
Anyone has any ideas to point me at the right direction?
Any help would be greatly appreciated.