Call for GPU programs other than GPU

littleeddy · September 17, 2008, 4:28am

Hello Everybody,

I am a graduate student who is writing a paper for optimizing the GPU program performance by dynamically adjusting the block size and loop-unrolling level according to the input. I tried different programs in SDK, but it seems that our method insensitive to the different input.

Do you guys know where I can find some other programs to test our approach? Or do you guys have some program written and will be willing to release for test? I tried to google but I didn’t find very good resources.

Thank you in advance.

littleeddy · September 17, 2008, 4:29am

Besides, I run the GPU program in linux…

E.D_Riedijk · September 17, 2008, 5:34am

Look on the CUDA zone, there you will find plenty of links to programs, many of which have sourcecode available.

alex_dubinsky · September 17, 2008, 6:38am

I am concerned that you’re taking concepts from the CPU world and not-quite-correctly applying them to CUDA.

For one thing, I know loop unrolling has very different purposes in the two camps. In addition, ‘block size’ may be the same words used as different phrases. (ie, the “block size” in a matrix-mul algorithm vs the “size of a block” in CUDA.)

littleeddy · September 19, 2008, 4:22am

Thank you. I found a lot of source code there.

littleeddy · September 19, 2008, 4:26am

Thanks for pointing this out. Sorry the words may be a little vague. By “block size” I mean the size of block in CUDA. By loop unrolling, I mean duplicate the loop body for several times to reduce the condition check.

alex_dubinsky · September 19, 2008, 5:25am

Thing is, on a CPU loop unrolling does more than reducing condition checks. It lets superscalar architectures better reorder instructions to hide their latency and also suffer fewer mispredictions. (Which doesn’t apply to a GPU.) On a GPU, loop unrolling has a very different, and even more critical role. It allows, in certain cases, for arrays to be placed into registers, which gives the fastest performance. For this to work the loop has to be unrolled completely. Any different amount ruins the optimization.

Regarding block size: that has leeway for some kernels, but usually a kernel’s author knows exactly what block size is optimal and there’s no point second-guesing. This is determined by resource utilization, meaning shared memory and register count. This choice may be re-evaluateable when new architectures come out which have different resources (particularly shared memory size). This, btw, is the real point behind ATLAS-like automatic block size tuning. It gets past different processors having very different cache sizes and heirarchies. Yet all CUDA GPUs that have been released are very similar, and have identical shared memory and other characteristics (although the number of registers did change). There’s no reason, yet, for an automatic approach to tuning.

However, the most important thing is, adjusting for more shared mem or registers usually isn’t as simple as just changing the kernel’s block dimensions. (It depends how the kernel is written.) Other things have to be manually done (such as changing blocking factors or moving more variables into regs.) An automatic adjusting strategy is algorithm-specific. Hence what ATLAS does is constrained to its own matrix-multiply code, and if another library wants to do something similar, it has to figure out how to accomplish that for itself.

But, you know, go ahead. Let us know if you find something interesting.

MisterAnderson42 · September 20, 2008, 1:22am

This kernel in HOOMD is extremely sensitive to block size changes. I don’t know that loop unrolling would make a bit of difference.

http://trac2.assembla.com/hoomd/browser/tr…cesum_kernel.cu

Topic		Replies	Views
Block size's effect on program performance, why does my program run faster at seemingly random sizes? CUDA Programming and Performance	5	4152	January 2, 2017
Changing block size on compiled programm. CUDA Setup and Installation	0	770	December 6, 2013
automatic loop unrolling CUDA Programming and Performance	8	11259	July 2, 2009
Dynamic launching of kernels CUDA Programming and Performance	2	1461	November 11, 2008
How to remove branching statements when running on an unknown device CUDA Programming and Performance	0	592	August 8, 2011
Expected performance CUDA Programming and Performance	7	1284	April 12, 2013
Block size and grid size CUDA Programming and Performance	5	8493	April 27, 2009
Unrolling makes performance worse CUDA Programming and Performance	7	960	April 18, 2019
How to determine the Block Size CUDA Programming and Performance	1	5987	September 4, 2009
Different results obtained in varying the block size CUDA Programming and Performance cuda	0	374	June 20, 2020

Call for GPU programs other than GPU

Related topics