I need to build a small library of kernels that will process 1D arrays of 128 million integers. All of the examples I have seen
deal with 2D problems, so I was hoping I could get a sanity check from the experts to see if my approach is correct for a 1D
problem. Here is my main and mykernel.

mykenel (int *A, int B, int C)
{
int Q = gridDim.XblockIdx.Y + blockIdx.x
int R = Q * blockDim.X * blockDimY
int IX = R + blockDim.XthreadIdx.Y + threadIdx.x

if (IX < N)
A[IX] = B[IX] + C[IX]
}

Since I have 128 million (2**27) integers to manipulate, I thot I would launch a kernel with 128 million threads.
I decided each thread block would have 256 threads.
I (arbitrarily) decided to make the grid square, so I computed the square root of N in my dimGrid initializations.

Inside of my kernel I now have a 2D grid that is composed of 2D thread blocks.
I simply go thru some calculations to create IX, the 1D index into the arrays, from all of the 2D grid and block information.

In my eyes your indexing is a bit unintuitive. This is my way:

mykenel (int *A, int *B, int *C)
{
int Q = blockIdx.x * blockDim.x + threadIdx.x;
int R = blockIdx.y * blockDim.y + threadIdx.y;
int IX = gridDim.x * blockDim.x * R + Q;
if (IX < N)
A[IX] = B[IX] + C[IX]
}

@BigMac: Your solution will fail to launch, there is a limit of 65K for each dimension of dimGrid.

@bzigon: use a fixed 1D grid and have each thread works on multiple elements. Several SDK examples use this arrangement ( I think MonteCarlo is one of them)