I have a very simply problem I cannot solve. I have a GTX295, CUDA 2.3, Win XP 32bit, VS 2008. I simply want a kernel to set every element of a large vector to 0 (for this example). Here are the relevant code sections:

global void zero(cufftComplex *a, size_t ne )
{
// subset range bins
size_t t = blockIdx.x * blockDim.x + threadIdx.x;
if (t<ne)
{
a[t].x = 0.0f;
a[t].y = 0.0f;
}
}

long n1 = 65535;
long n2 = 257;
long nt = 256; // number of threads per block
long ne = n1 * n2;

I choose the parameters n1 and n2 to illustrate the point. I know that the 1.3 compute capable device have a max size of each dim in the grid of 65535. My question is: How do I do trivial operations within a kernel on a large 1D array where the size of the array makes the blocksize exceed 65535?

Yes, I could use a 2D grid but I am unsure of the 2D -> linear indexing. The data is actually 2D of size n1,n2, just stored in a 1D array n1n2. Given an index in 2D i,j the 1D index k = i + ji. But with the 2D grid, I am unsure of how to remap. I see 2D arrays discussed in the programming guide but do not see how to actually allocate them on the device.