In this presentation “Better Performance at Lower Occupancy” http://www.nvidia.com/content/gtc-2010/pdfs/2238_gtc2010.pdf. The key message is that a better performance can be achieved by increasing the number of elements-per-thread at lower occupancy. As such, the performance increases because it provides more ILP (instruction level parallelism) for each thread. Normally, we employ modify-compile-run-repeat-cycle (aka auto-tuning) to determine the optimal number of elements-per-thread.

My question is: is there any way to determine the optimal number of elements-per-thread using some analysis without running the code? e.g. the matrix multiplication example in the presentation, use some formula or model to calculate or predict the number (in this case, the number is 36, page 65 in the slide).

From having spent significant time optimizing various matrix operations on GPUs, I’d say the answer is “No”, but I’d be happy if someone can point to research that says otherwise. By “optimal” I assume we mean within 5% of the best solution found empirically for a specific matrix algorithm with one specific compiler version on one specific GPU SKU.

Your response is very informative, and it makes the original question more complete.

By summing up your response, the question becomes the following:
For a particular application(e.g., vector addition or matrix multiplication), we employ modify-compile-run-repeat-cycle to determine the optimal number of elements-per-thread. It requires modify, compile and run the code. Is there any way to determine the optimal number without running the code. By “optimal” we mean within 5% of the best solution found empirically for an algorithm with one specific compiler version on one specific GPU SKU. In other words, can we build an analytical model to determine the optimal number of elements-per-thread.

My idea is that we could possibly use the assembly code to build the analytical model to predict the optimal number of elements-per-thread, but I am not sure if it is doable or not. Has anyone done the similar project before. Any pointer or comment is welcome.

You could certainly build a model, but in my (dated) experience models involving significant memory traffic are very hard to get to within a reasonable error limit, say 10%. One difficulty is in accounting for the exact stream of transaction presented to DRAM, and the DRAMs response to those transactions. Simple approaches like dialing a read/write ratio into the model often don’t cut it, modern memory hierarchies are too complex for that.

A model for something simple like a vector addition should be doable, but if you need a model that is reasonably accurate across a variety of algorithms, I personally would consider the amount of model validation / calibration, and re-validation for each new architecture or memory type to be too painful.

I actually see auto-tuning as a best practice for computations like this, and it is a technique that was established for CPU-based computations for well over a decade before CUDA came along (see ATLAS, for example). I understand that building a framework for auto-tuning is quite a bit of work, but once you have it, things should be relatively painless.