Problem with unrolling loops

bit_mapper · November 9, 2011, 7:54pm

I tried three ways when there is loop inside kernel.

run without any unrolling.
put"#pragma unroll" immediately before the loop for the compiler to optimize automatically.
Rather than using #pragma unroll, I manually unroll the loop by enumerating every i. (0<i<10)

The running time for the above is

168s
168s
61s

So surprisingly, using #pragma unroll has no effect on kernel efficiency, and even worse than manually unrolling myself. Anyone knows what’s going on there? Thanks

njuffa · November 9, 2011, 9:46pm

The first thing you might want to check for case (2) is whether there is actually any unrolling taking place. The performance numbers suggest there is not. You can use cuobjdump to compare the machine code generated.

There are various reasons why the compiler may not be able to unroll a loop that a programmer has requested to be unrolled. At least in some instances you will see an advisory warning with a brief explanation when that happens. Do you see any such messages in the compiler output? In my experience a common unrolling inhibitor are issues with “unstructured” control flows (goto, conditional return; possibly also break and continue). I seem to recall there is also a size limit on unrolled code.

Could you post the code with the loop in question?

bit_mapper · November 9, 2011, 10:40pm

Thanks for the reply. I don’t see any compiler output reporting the unroll information. Is there any option that I need to add up to makefile to show that?

I do have some IF condition inside the loop, but I don’t know how to avoid it. Each of the thread is dealing with one single item in a array, but might have different processing according to the value of the single item via the conditional branches.

#pragma unroll

 for (j = 0; j < LEN; j++) {

    if (DB & 0x200000) {

       #pragma unroll

       for (i=0;i<HASHES; i++)

         hash[i] ^= row_matrix[i][j];

    }

    DB<<=1;

 }

njuffa · November 9, 2011, 11:51pm

What are the values of LEN and HASHES? What is DB, a 32-bit unsigned int variable? I tried the following with CUDA 4.0 and see from cuobjdump output that both loops were unrolled.

#define LEN    10

#define HASHES 10

#define row_matrix(row,col) row_matrix[row*LEN+col]

unsigned int *hash = parms.argy;

    unsigned int *row_matrix = parms.argz;

    for (i = ctaStart + threadIdx.x; i < parms.n; i += totalThreads) {

        unsigned int DB = parms.argx[i];

#pragma unroll

        for (int j = 0; j < LEN; j++) {

            if (DB & 0x200000) {

#pragma unroll

                for (int k = 0; k < HASHES; k++) {

                    hash[k] ^= row_matrix (k,j);

                }

            }

            DB <<= 1;

        }

    }

bit_mapper · November 10, 2011, 11:42pm

But I’m using SM_20, which prevents me from using cuobjdump. I’m using SM_20 because I want to use more than 16K shared memory.
Then how can I know whether my several loops or even three-level inner loop is well unrolled?

njuffa · November 11, 2011, 2:36am

Most unrolling optimizations happen at the PTX level, meaning you can use the -keep commandline option of nvcc and inspect the generated .ptx file. As for looking at the generated machine code, cuobjdump from CUDA 4.0 can disassemble sm_2x code:

[…]

New & Improved Developer Tools

[…]

GPU binary disassembler for Fermi architecture (cuobjdump)

SeanB · November 11, 2011, 9:47pm

cuobjdump works fine with sm_20.

devkec · November 17, 2011, 7:03pm

You can force a loop unrolling by using templates:

Loop Unrolling over Template Arguments

This is a quite old version of my helper-lib. Scroll down to “Partial Unroller” and use this code.
You have to create a functor containing your loop body, since nvcc doesn’t support C++11’s lamda functions (yet)

As the buil-in #pragma unroll doesn’t always do what you want, this works cleaner and easier than manually unrolling the loops.

bit_mapper · November 17, 2011, 8:37pm

Thanks for pointing me there. But can I call the function or lambda function inside NVCC kernel? Supposed host functions can only be called on host side, rather than device side. Is it correct?

devkec · November 24, 2011, 10:09am

You can call the function from inside the kernel, just decorate it with device

I usually create a functor with all necessary values and pass this to the Unroller. The only disadvantage is that the code gets scattered in your source files:

struct func_t {

 float* val;

 __device__ void operator()(int i) { val[i] = val[i]*val[i]; }

};

__global__ void mykernel (..., int N, ...) {

...

func_t func;

func.val = ...;

UnrollerP<16>::step(func, N);

...

}

The compiler optimizes the extra values in func_t away.

This is also great for testing out different Unroll sizes: Make the kernel a template-function with an int parameter and use this for the Unroller.

Topic		Replies	Views
automatic loop unrolling CUDA Programming and Performance	8	11186	July 2, 2009
Loop unrolling CUDA Programming and Performance	3	2729	April 25, 2012
#pragma unroll not behaving as expected CUDA Programming and Performance	1	534	September 10, 2022
Unroll nested for-loops? CUDA Programming and Performance	1	4712	June 14, 2012
#pragma unroll not working? CUDA Programming and Performance	3	4985	June 8, 2009
CUDA #pragma unroll is slower than unrolled code CUDA Programming and Performance	2	2048	February 7, 2018
BUG? nvcc fails to unroll the loop CUDA Programming and Performance	6	6088	May 26, 2009
loop unrolling CUDA Programming and Performance	11	17152	January 31, 2008
Does CUDA automatically unroll loops? CUDA Programming and Performance	4	5827	September 16, 2011
#Pragma unroll doesn't work? CUDA Programming and Performance	8	6119	September 19, 2008

Problem with unrolling loops

Related topics