Here is a simple question.
What is the optimal layout of the grid?
For example, 10000x1 grid is equivalent to 100*100 one in terms of performance?
What is the number of required grid block is general number like 9261?
My first idea is
dimGrid( (int)sqrt(9261)+1, (int)sqrt(9261)+1 )
and check the block index inside the kernel.
Do you think this is nice?