[bug?] #pragma unroll cannot make loop counter constant and does not enable constexpr

HolyChen · September 6, 2019, 7:18am

Now, I have some code like follows.

constexpr int N = 3;

#prgram unroll
for (int i = 0; i < N; i++)
{
   foo(i);
}

It’s clear that N is constant. As I expected, the variable i should be constant as well after unroll. But the nvcc cannot make the variable i constant in compiling, so the foo(i) executes during running not compiling.

The problem caused is not only about slight performance loss, but correctness!

For example, the function foo use a constexpr array A in host.

After CUDA 8, the constexpr variable in host can be used by device function if its address is not in need during running. However, the variable i is not recognized as constant, nor compiler optimized foo(i) in compiling. Even worse, the nvcc doesn’t try to link array A which is in host, so the behavior of foo(i) in running is UNDEFINED!

An alternative method is making use of constant memory in CUDA, but the performance loss cannot be erased.

Robert_Crovella · September 6, 2019, 2:12pm

I don’t see any evidence of that in the cases I have tried that look like what you have shown.

You’ve made a claim without any evidence to support it.

I don’t believe what you are saying is true. All of the 6 or so test cases I have just put together show that the compiler is well able to recognize that N is constant and do appropriate optimizations. The loop is gone, and any call to foo is gone from what I can see.

HolyChen · September 6, 2019, 2:25pm

I didn’t say the N cannot be recognized as constant. What I said is the loop counter, that is i. The i is in a fixed range of 0 → N - 1, so, the i should be used as constant.

Robert_Crovella · September 6, 2019, 2:35pm

You’ve provided no evidence to support that claim.

My tests show that i is recognized as constant (within the context of each loop iteration), the loop is unrolled, and the compiler applies all kinds of optimizations and removes the loop and any call to foo entirely.

It’s OK if you don’t believe me. Here’s my test case:

$ cat t24.cu
#include <stdio.h>

constexpr int N = 3;

__device__ int foo(int i, int k) {
  return i*k;
}

__global__ void k(int j, int w){

  int val = j;
#pragma unroll
  for (int i = 0; i < N; i++){
    val += foo(i,j)-w;}
  if (val == 0) printf("val == 0\n");
}

$ nvcc -c t24.cu
$ cuobjdump -sass ./t24.o

Fatbin elf code:
================
arch = sm_30
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

        code for sm_30
                Function : _Z1kii
        .headerflags    @"EF_CUDA_SM30 EF_CUDA_PTX_SM(EF_CUDA_SM30)"
                                                                     /* 0x22f2c28202804307 */
        /*0008*/                   MOV R1, c[0x0][0x44];             /* 0x2800400110005de4 */
        /*0010*/                   MOV R0, c[0x0][0x140];            /* 0x2800400500001de4 */
        /*0018*/                   MOV R3, c[0x0][0x144];            /* 0x280040051000dde4 */
        /*0020*/                   SHL R0, R0, 0x2;                  /* 0x6000c00008001c03 */
        /*0028*/                   IMUL32I R3, R3, 0x3;              /* 0x100000000c30dca2 */
        /*0030*/                   ISETP.NE.AND P0, PT, R0, R3, PT;  /* 0x1a8e00000c01dc23 */
        /*0038*/               @P0 EXIT;                             /* 0x80000000000001e7 */
                                                                     /* 0x2002f2f2f0420047 */
        /*0048*/                   MOV32I R4, 0x0;                   /* 0x1800000000011de2 */
        /*0050*/                   MOV R6, RZ;                       /* 0x28000000fc019de4 */
        /*0058*/                   MOV32I R5, 0x0;                   /* 0x1800000000015de2 */
        /*0060*/                   MOV R7, RZ;                       /* 0x28000000fc01dde4 */
        /*0068*/                   JCAL 0x0;                         /* 0x1000000000011c07 */
        /*0070*/                   EXIT;                             /* 0x8000000000001de7 */
        /*0078*/                   BRA 0x78;                         /* 0x4003ffffe0001de7 */
                .................

Fatbin ptx code:
================
arch = sm_30
code version = [6,4]
producer = <unknown>
host = linux
compile_size = 64bit
compressed
$

You can’t get to this level of optimization without recognizing that:

N is constant and therefore the loop is unrolled
for each unrolled loop iteration, i is a known value at compile time
furthermore, the arithmetic can now be deduced for each iteration
the arithmetic at each iteration is now constant, so can be replaced with a fixed value based on the inputs j and w
the fixed values for each loop iteration can all be summed
the final value can be computed simply as a sum of 2 products

The compiler is doing all that.

So N is recongnized as globally constant, and the value of i is known at compile time, at each loop iteration.

HolyChen · September 6, 2019, 2:37pm

The sample code is as follows.

template <typename T>
struct Foo
{
    static constexpr T f1(int N, int i)
    {
        return arr[N][i];
    }

    static constexpr T arr[5][5] = { /* some value */ };
}

njuffa · September 6, 2019, 2:52pm

Best I can tell, that’s a code snippet without a loop. Are you using a release build for your experiments? Debug builds have all optimizations disabled.

HolyChen · September 6, 2019, 2:53pm

Yes, all builds are in release mode.

njuffa · September 6, 2019, 3:02pm

My suggestion would be to post a minimal, self-contained, complete (buildable and runable) example code along with CUDA version and command line used to invoke the compiler, if you want others to look at code generation issues, desire help in debugging, etc.

Topic		Replies	Views
BUG? nvcc fails to unroll the loop CUDA Programming and Performance	6	6090	May 26, 2009
#pragma unroll CUDA Programming and Performance	20	5858	July 27, 2010
Does CUDA automatically unroll loops? CUDA Programming and Performance	4	5847	September 16, 2011
loop unrolling CUDA Programming and Performance	7	1559	April 4, 2011
Use #pragma unroll with macro CUDA Programming and Performance	7	361	November 26, 2024
#Pragma unroll doesn't work? CUDA Programming and Performance	8	6126	September 19, 2008
NVCC loop bug since cuda 5.5 CUDA Programming and Performance	5	1669	June 12, 2014
Cuda compiler loop unroll bug? CUDA Programming and Performance	14	2575	October 25, 2017
Different output of code when not unrolling loop CUDA Programming and Performance	16	1249	August 22, 2022
forcing loop unrolls CUDA Programming and Performance	4	762	October 11, 2018

[bug?] #pragma unroll cannot make loop counter constant and does not enable constexpr

Related topics