how to device the size of block and grid for Kernel?
if want to use CUDA instead of openGL to render, what value should be set?
How to structure the block and grid configuration, and how to map data to threads is entirely up to the CUDA programmer. In other words, CUDA’s approach is light on constraints and emphasizes flexibility.
Some useful heuristics for initial configuration are: (1) choose a multiple of 32, between 128 and 256, as the number of threads per block (2) make a grid that has as many threads as there are output elements, that is, each thread is assigned one per data element. (3) let each thread gather the input data to produce its output element. (4) threads in each thread block may cooperate to pull input data into shared memory, functioning as a software-controlled cache.
I would try to keep block and grid organization as simple as possible, preferring 1D over 2D over 3D organization. Obviously for some use case (processing of 2D tiles of matrices or images) the 2D organization is entirely natural and fine to use.
If you haven’t done so yet, I recommend reading (or at least perusing) the CUDA Best Practices Guide.