CUDA Fortran: unroll directive for kernel code?


I’m trying to compare the performance of CUDA C and CUDA Fortran because we have a large Fortran code base and want to avoid porting it to C. So far I’ve been able to optimize most aspects of CUDA Fortran but as soon as I add a well placed #pragma unroll to the C code the Fortran codes falls behind.

E.g. this kernel code in CUDA C is unrolled:

uint end = min(gpu_shared_mem_block_size, n - block_offset);
#pragma unroll 16
for (uint j = 0; j < end; j++)
	sum += block_a[threadIdx.y][j] * block_b[j][threadIdx.x];

And I’m wondering if there is a way to do the same in this CUDA Fortran code:

block_end = min(16, n - shared_block_offset)
do k = 1, block_end
	sum = sum + A_shared(threadidx%x, k) * B_shared(k, threadidx%y)
end do

I’ve searched the documentation and have found “!$pgi unroll” and “!$acc unroll”. It seems that both only apply to host code and do not change the way kernel code is generated. Did I miss something?

Thanks in advance.

Hi S. Soll,

It seems that both only apply to host code and do not change the way kernel code is generated. Did I miss something?

No, you didn’t miss anything. The unroll directive only applies to host code. However, the compiler may automatically unroll loops if it finds it advantageous to do so.

  • Mat

Thanks for the reply.

I noticed that with -O3 the loop gets unrolled into groups of 4. That increases performance quite a bit. Is there any way to let the compiler unroll the entire loop?

You can try using the “-Munroll” flag.

Setting “-Munroll=c:16” will completely unroll loops having a loop count of 16 or less. However, this only works on loops who’s counts are known at compile time and in your case, the count isn’t known until runtime.

Instead, you can try using “-Munroll:n:16” which will unroll single block loops by a 16. However, this may not be optimal when block_end is less than 16.

“-Munroll=m:” is the same as “-Munroll=n:” except sets the unroll factor for multi-block loops.

  • Mat

Thanks for the hint. I already tried -Munroll (sorry, I should have written that in my first post) but it had no effect on the performance.

I tried it again with n:16, m:16 and c:16,:n:16:m:16 and looked at the code generated by CUDA Fortran. In the generated C code the code is always unrolled with a block size of 4. The resulting PTX code also looks like that (4 groups of ld.shared instructions).

However the -Munroll option triggered the unroll with a block size of 4. If I omit -Munroll no unrolling is performed. This is pretty much what the man page says and the default value of 4 also matches (the default value is mentioned for the c option). However the options n and m do not seem to change this default value for loops with unknown counts.

Without -Munroll (and no -O option) the loop was not unrolled. But apart from it’s presence (and the default value of 4) the -Munroll option does not seem to affect the generated GPU code. Is this intended for GPU code?

Hi Mat,
Is there a plan to support unroll of loops in kernel code soon?


Hi Tuan,

The Accelerator unroll clause was added to the 11.0 compilers so is available now. The unroll clause applies to a loop directive so it’s location changes it’s meaning.

“!$acc do parallel unroll(n)” unrolls the parallel (block) dimension, “!$acc do vector unroll(n)” unrolls the vector (thread) dimension, while “!$acc do seq unroll(n)” will control the unrolling of loops in the kernel itself. We changed the design a bit in order give users greater control.

Hope this helps,

That’s a good news.


Tuan: You are right, the unroll directive in CUDA Fortran is not being used when generating the GPU code. We will add that to our work list. We have worked on the unroller for the PGI Accelerator programming model, and that shares the code generator with CUDA Fortran, so we should be able to make use of it in CUDA Fortran as well. Thanks for pointing this out, we’ll make this high priority.