how to supass 512 threads?

Hi,when I setup a GPU grid as following to solve linear equations:
dim3 dimBlock(1,512);
dim3 dimGrid((dim+dimBlock.x-1)/dimBlock.x, (dim+dimBlock.y-1)/dimBlock.y);
solveAk<<<128,128>>>(d_A,d_B,k,dim,d_returnValue);
LUbuildKernel<<<dimGrid,dimBlock>>>(d_A,d_B,k,dim,d_returnValue)
I can only get correct answers with (N:the power of linear equations)N<512.
if the GPU grid as following:
dim3 dimBlock(16,16);
dim3 dimGrid((dim+dimBlock.x-1)/dimBlock.x, (dim+dimBlock.y-1)/dimBlock.y);
solveAk<<<128,128>>>(d_A,d_B,k,dim,d_returnValue);
LUbuildKernel<<<dimGrid,dimBlock>>>(d_A,d_B,k,dim,d_returnValue)
I can only get correct answers with (N:the power of linear equations)N<16.
How can I design a GPU grid or how to modify kernel functions to supass 512 threads?
thanks

my kernel functions are as following:
global void LUbuildKernel(float A, float B,int k,int dim,int returnValue)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if(i!=k && j<dim && i<dim)
{
(A+idim+j)=
(A+i
dim+j)- (
(A+idim+k))((A+kdim+j));
}
returnValue[0]=1;

}
global void solveAk(float A,float B,int k,int dim,int returnValue)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if( fabsf(
(A+k
dim+k)) < 0.000000001 ) returnValue[0]=100000000;
while(i<dim){
if(i>k)
{
(A+kdim+i)=
(A+kdim+i)/((A+kdim+k));
}
if(i!=k) (B+i)=(B+i)-
(A+idim+k)((B+k));
i+=blockDim.x
gridDim.x;
}
}

Without going into too much detail of how your code works, maybe this will help. I use this to make a simple grid of threads that I can index linearly. If you don’t care about how your blocks are set up, you just want a billion threads, this may help.

// tools for starting a linear kernel
#define K_THREADS 256
#define K_INDEX() ((gridDim.x * blockIdx.y + blockIdx.x) * blockDim.x + threadIdx.x)
inline dim3 K_GRID(int n, int threads = K_THREADS) {
	int blocks = (int)ceilf(sqrtf((float)n/threads));
	dim3 grid(blocks, blocks);
	return grid;
}

In host code:

Kernel<<<K_GRID(n), K_THREADS>>>(n);

In your kernel:

__global__ void Kernel(int n) {
	int i = K_INDEX();
	int work = i < n;

	if (work) {

		// --> do something

	}
}

Thank you very much for Adamjmac’s reply. your GPU grid for 1D vector process is great! But when I use it for 2D matrics process of my aboue kernel function global void LUbuildKernel(float *A, float *B,int k,int dim,int *returnValue) , I can not get right result. I don not know if my algorithm is OK for large scare matrics? I am new to CUDA.
Thanks.
waiting for further reply.