Max blocks per grid

I have a very simply problem I cannot solve. I have a GTX295, CUDA 2.3, Win XP 32bit, VS 2008. I simply want a kernel to set every element of a large vector to 0 (for this example). Here are the relevant code sections:

global void zero(cufftComplex *a, size_t ne )
{
// subset range bins
size_t t = blockIdx.x * blockDim.x + threadIdx.x;
if (t<ne)
{
a[t].x = 0.0f;
a[t].y = 0.0f;
}
}

long n1 = 65535; long n2 = 257; long nt = 256; // number of threads per block long ne = n1 * n2;
cufftComplex *data_d;
cudaMalloc( (void **) &data_d,sizeof(cufftComplex)*ne);

long n_blocks    = ne/nt + (ne%nt == 0 ? 0:1);      
zero<<< n_blocks, nt >>>(data_d, ne);

cudaFree(data_d);

I choose the parameters n1 and n2 to illustrate the point. I know that the 1.3 compute capable device have a max size of each dim in the grid of 65535. My question is: How do I do trivial operations within a kernel on a large 1D array where the size of the array makes the blocksize exceed 65535?

b

why not using 2D grid?

maximum dimension of 2D grid is (65535, 65535) and each block can has 256 threads (or 512 threads), then

you can access array of arbitrary size. Even 1D array, what you need to do is to determine a index mapping

map 2D index to 1D array

Yes, I could use a 2D grid but I am unsure of the 2D -> linear indexing. The data is actually 2D of size n1,n2, just stored in a 1D array n1n2. Given an index in 2D i,j the 1D index k = i + ji. But with the 2D grid, I am unsure of how to remap. I see 2D arrays discussed in the programming guide but do not see how to actually allocate them on the device.

[codebox]/*

suppose that matrix a with dimension n1 x n2 is stored in 1D array by row-major

of row of matrix a is n1 and

of col of matrix a is n2

n2 (col)

      ----------------  

      |               |

      |    a          |

n1 (row) | |

      |               |

      ---------------- 

The following code use index (j,i) to represent row i and col j

i

----------------  

|               |

j | a(j,i) |

|               |

|               |

---------------- 

*/

global void zero(cufftComplex *a, int row, int col )

{

int i = blockIdx.x * blockDim.x + threadIdx.x;

int j = blockIdx.y * blockDim.y + threadIdx.y;

if (i < col && j < row ){

// row-major index

// in this setting, every warp of threads access contiguous data entry

int k = j*col + i ; 	

	a[k].x = 0.0f;

	a[k].y = 0.0f;

}

}

/*

in main program, use 2D grid with natural index in CUDA

i

 ---------------------------------------

 | block(0,0) | block(1,0) | block (2,0) ...

j ---------------------------------------

 | block(0,1) | block(1,1) | block (2,1) ...

 ---------------------------------------     

row of matrix has # of grid = (n1 + dimBlock.y – 1) / dimBlock.y

col of matrix has # of grid = (n2 + dimBlock.x – 1) / dimBlock.x

<-- dimBlock.x -->

^ ------------------

| | |

dimBlock.y | block(i,j) |

| | |

v ------------------

*/

int main()

{

int n1 = 65535;

int n2 = 257;

int ne = n1 * n2;

cufftComplex *data_d;

cudaMalloc( (void **) &data_d,sizeof(cufftComplex)*ne);

dim3 dimBlock(16, 16); // nt = 256, number of threads per block

dim3 dimGrid((n2 + dimBlock.x – 1) / dimBlock.x, (n1 + dimBlock.y – 1) / dimBlock.y);

zero<<<dimGrid, dimBlock>>>(data_d, n1, n2);

cudaFree(data_d);

}[/codebox]