Can Dg*Ns <= 256KB be a general heurist? kernel config

Hi, (I wish NV employees could please kindly help on this question) (8800GTX)
kernel’s para <<<Dg, Db, Ns>>> is hard to config. Best way is to tune. But for a big system, we wish to avoid tuning. We need a principle to config, so that the performance is not far from best case, say, >80% of best performance.
On of my current experiences is that: Ns at least <8KB, at best <4KB.
For Dg, I have contravertial opinion:

A: “Given fixed Ns, Dg * Ns <=256KB is a safe& not bad principle to determine Dg (block num).” Say, given Ns = 4KB, Dg best at 64.

B: “I never meet crush with Dg * Ns > 256KB. Say, Dg.x= 256, Ns = 4KB.”

My questions:
1, is Dg > 256KB/Ns dangerous?
2, is Dg > 256KB/Ns pointless?
I tested that a big Dg won’t bring significant performance gain than if it’s set to be 256KB/Ns. I aslo tested that Dg doesn’t matter as long as it’s not too small (say, <32).
3, is Dg*Ns <=256KB a good heuristic?

Ns determines the amount of shared memory, and is entirely dependent on the algorithm you are using. Some algorithms require shared memory, some don’t. Most I’ve worked with have the shared memory size tied to the block size. I don’t think there is any general prescription that can tell you the “optimal” shared memory size for your algorithm.

You have Dg as the number of blocks if I understand correctly. Choosing this optimally is easy: the programming guide says that you need at least 100 blocks to be in an optimal performance region. 1000 if you want to scale to future devices. My performance tests agree with this. Execution time increases in a stair step fashion and there is a lot of overhead until Dg = 200 or so. After that, everything scales nearly perfectly linearly. There is no reason that I know of not to have a heuristic that forces Dg to be small.

The block size, Db, is a much more complicated beast and I doubt there can be any general formula to find the optimal value. The block size needs to balance memory access patterns, memory read after write dependencies, warp occupancy, and a host of other issues, all working together in a non-trivial way. Figuring out the optimal value would require nothing less than a full fledged device simulator… Which we all of course already have, the GPU itself! I determine Db by writing kernels which adapt to their block size, then benchmarking the code on test cases at all possible block sizes. Hopefully the optimal block size obtained from these tests will remain optimal under real-world conditions.

Of course, not all algorithms can be written to have an adaptable block size, so sometimes you are just stuck with it. If such an algorithm is performing poorly (as measured by the sustained memory throughput and/or GFLOPs), then maybe it is time to look for a new algorithm.