Problem with unrolling loops

I tried three ways when there is loop inside kernel.

  1. run without any unrolling.
  2. put"#pragma unroll" immediately before the loop for the compiler to optimize automatically.
  3. Rather than using #pragma unroll, I manually unroll the loop by enumerating every i. (0<i<10)

The running time for the above is

  1. 168s
  2. 168s
  3. 61s

So surprisingly, using #pragma unroll has no effect on kernel efficiency, and even worse than manually unrolling myself. Anyone knows what’s going on there? Thanks

The first thing you might want to check for case (2) is whether there is actually any unrolling taking place. The performance numbers suggest there is not. You can use cuobjdump to compare the machine code generated.

There are various reasons why the compiler may not be able to unroll a loop that a programmer has requested to be unrolled. At least in some instances you will see an advisory warning with a brief explanation when that happens. Do you see any such messages in the compiler output? In my experience a common unrolling inhibitor are issues with “unstructured” control flows (goto, conditional return; possibly also break and continue). I seem to recall there is also a size limit on unrolled code.

Could you post the code with the loop in question?

Thanks for the reply. I don’t see any compiler output reporting the unroll information. Is there any option that I need to add up to makefile to show that?

I do have some IF condition inside the loop, but I don’t know how to avoid it. Each of the thread is dealing with one single item in a array, but might have different processing according to the value of the single item via the conditional branches.

#pragma unroll

 for (j = 0; j < LEN; j++) {

    if (DB & 0x200000) {

       #pragma unroll

       for (i=0;i<HASHES; i++)

         hash[i] ^= row_matrix[i][j];

    }

    DB<<=1;

 }

What are the values of LEN and HASHES? What is DB, a 32-bit unsigned int variable? I tried the following with CUDA 4.0 and see from cuobjdump output that both loops were unrolled.

#define LEN    10

#define HASHES 10

#define row_matrix(row,col) row_matrix[row*LEN+col]

unsigned int *hash = parms.argy;

    unsigned int *row_matrix = parms.argz;

    for (i = ctaStart + threadIdx.x; i < parms.n; i += totalThreads) {

        unsigned int DB = parms.argx[i];

#pragma unroll

        for (int j = 0; j < LEN; j++) {

            if (DB & 0x200000) {

#pragma unroll

                for (int k = 0; k < HASHES; k++) {

                    hash[k] ^= row_matrix (k,j);

                }

            }

            DB <<= 1;

        }

    }

But I’m using SM_20, which prevents me from using cuobjdump. I’m using SM_20 because I want to use more than 16K shared memory.
Then how can I know whether my several loops or even three-level inner loop is well unrolled?

Most unrolling optimizations happen at the PTX level, meaning you can use the -keep commandline option of nvcc and inspect the generated .ptx file. As for looking at the generated machine code, cuobjdump from CUDA 4.0 can disassemble sm_2x code:

http://developer.nvidia.com/cuda-toolkit-40

[…]

New & Improved Developer Tools

[…]

GPU binary disassembler for Fermi architecture (cuobjdump)

cuobjdump works fine with sm_20.

You can force a loop unrolling by using templates:

Loop Unrolling over Template Arguments

This is a quite old version of my helper-lib. Scroll down to “Partial Unroller” and use this code.
You have to create a functor containing your loop body, since nvcc doesn’t support C++11’s lamda functions (yet)

As the buil-in #pragma unroll doesn’t always do what you want, this works cleaner and easier than manually unrolling the loops.

Thanks for pointing me there. But can I call the function or lambda function inside NVCC kernel? Supposed host functions can only be called on host side, rather than device side. Is it correct?

You can call the function from inside the kernel, just decorate it with device

I usually create a functor with all necessary values and pass this to the Unroller. The only disadvantage is that the code gets scattered in your source files:

struct func_t {

 float* val;

 __device__ void operator()(int i) { val[i] = val[i]*val[i]; }

};

__global__ void mykernel (..., int N, ...) {

...

func_t func;

func.val = ...;

UnrollerP<16>::step(func, N);

...

}

The compiler optimizes the extra values in func_t away.

This is also great for testing out different Unroll sizes: Make the kernel a template-function with an int parameter and use this for the Unroller.