I am a graduate student who is writing a paper for optimizing the GPU program performance by dynamically adjusting the block size and loop-unrolling level according to the input. I tried different programs in SDK, but it seems that our method insensitive to the different input.
Do you guys know where I can find some other programs to test our approach? Or do you guys have some program written and will be willing to release for test? I tried to google but I didn’t find very good resources.
I am concerned that you’re taking concepts from the CPU world and not-quite-correctly applying them to CUDA.
For one thing, I know loop unrolling has very different purposes in the two camps. In addition, ‘block size’ may be the same words used as different phrases. (ie, the “block size” in a matrix-mul algorithm vs the “size of a block” in CUDA.)
Thanks for pointing this out. Sorry the words may be a little vague. By “block size” I mean the size of block in CUDA. By loop unrolling, I mean duplicate the loop body for several times to reduce the condition check.
Thing is, on a CPU loop unrolling does more than reducing condition checks. It lets superscalar architectures better reorder instructions to hide their latency and also suffer fewer mispredictions. (Which doesn’t apply to a GPU.) On a GPU, loop unrolling has a very different, and even more critical role. It allows, in certain cases, for arrays to be placed into registers, which gives the fastest performance. For this to work the loop has to be unrolled completely. Any different amount ruins the optimization.
Regarding block size: that has leeway for some kernels, but usually a kernel’s author knows exactly what block size is optimal and there’s no point second-guesing. This is determined by resource utilization, meaning shared memory and register count. This choice may be re-evaluateable when new architectures come out which have different resources (particularly shared memory size). This, btw, is the real point behind ATLAS-like automatic block size tuning. It gets past different processors having very different cache sizes and heirarchies. Yet all CUDA GPUs that have been released are very similar, and have identical shared memory and other characteristics (although the number of registers did change). There’s no reason, yet, for an automatic approach to tuning.
However, the most important thing is, adjusting for more shared mem or registers usually isn’t as simple as just changing the kernel’s block dimensions. (It depends how the kernel is written.) Other things have to be manually done (such as changing blocking factors or moving more variables into regs.) An automatic adjusting strategy is algorithm-specific. Hence what ATLAS does is constrained to its own matrix-multiply code, and if another library wants to do something similar, it has to figure out how to accomplish that for itself.
But, you know, go ahead. Let us know if you find something interesting.