I tried three ways when there is loop inside kernel.
run without any unrolling.
put"#pragma unroll" immediately before the loop for the compiler to optimize automatically.
Rather than using #pragma unroll, I manually unroll the loop by enumerating every i. (0<i<10)
The running time for the above is
168s
168s
61s
So surprisingly, using #pragma unroll has no effect on kernel efficiency, and even worse than manually unrolling myself. Anyone knows what’s going on there? Thanks
The first thing you might want to check for case (2) is whether there is actually any unrolling taking place. The performance numbers suggest there is not. You can use cuobjdump to compare the machine code generated.
There are various reasons why the compiler may not be able to unroll a loop that a programmer has requested to be unrolled. At least in some instances you will see an advisory warning with a brief explanation when that happens. Do you see any such messages in the compiler output? In my experience a common unrolling inhibitor are issues with “unstructured” control flows (goto, conditional return; possibly also break and continue). I seem to recall there is also a size limit on unrolled code.
Could you post the code with the loop in question?
Thanks for the reply. I don’t see any compiler output reporting the unroll information. Is there any option that I need to add up to makefile to show that?
I do have some IF condition inside the loop, but I don’t know how to avoid it. Each of the thread is dealing with one single item in a array, but might have different processing according to the value of the single item via the conditional branches.
#pragma unroll
for (j = 0; j < LEN; j++) {
if (DB & 0x200000) {
#pragma unroll
for (i=0;i<HASHES; i++)
hash[i] ^= row_matrix[i][j];
}
DB<<=1;
}
What are the values of LEN and HASHES? What is DB, a 32-bit unsigned int variable? I tried the following with CUDA 4.0 and see from cuobjdump output that both loops were unrolled.
#define LEN 10
#define HASHES 10
#define row_matrix(row,col) row_matrix[row*LEN+col]
unsigned int *hash = parms.argy;
unsigned int *row_matrix = parms.argz;
for (i = ctaStart + threadIdx.x; i < parms.n; i += totalThreads) {
unsigned int DB = parms.argx[i];
#pragma unroll
for (int j = 0; j < LEN; j++) {
if (DB & 0x200000) {
#pragma unroll
for (int k = 0; k < HASHES; k++) {
hash[k] ^= row_matrix (k,j);
}
}
DB <<= 1;
}
}
But I’m using SM_20, which prevents me from using cuobjdump. I’m using SM_20 because I want to use more than 16K shared memory.
Then how can I know whether my several loops or even three-level inner loop is well unrolled?
Most unrolling optimizations happen at the PTX level, meaning you can use the -keep commandline option of nvcc and inspect the generated .ptx file. As for looking at the generated machine code, cuobjdump from CUDA 4.0 can disassemble sm_2x code:
[…]
New & Improved Developer Tools
[…]
GPU binary disassembler for Fermi architecture (cuobjdump)
This is a quite old version of my helper-lib. Scroll down to “Partial Unroller” and use this code.
You have to create a functor containing your loop body, since nvcc doesn’t support C++11’s lamda functions (yet)
As the buil-in #pragma unroll doesn’t always do what you want, this works cleaner and easier than manually unrolling the loops.
Thanks for pointing me there. But can I call the function or lambda function inside NVCC kernel? Supposed host functions can only be called on host side, rather than device side. Is it correct?
You can call the function from inside the kernel, just decorate it with device
I usually create a functor with all necessary values and pass this to the Unroller. The only disadvantage is that the code gets scattered in your source files: