[bug?] #pragma unroll cannot make loop counter constant and does not enable constexpr

Now, I have some code like follows.

constexpr int N = 3;

#prgram unroll
for (int i = 0; i < N; i++)
{
   foo(i);
}

It’s clear that N is constant. As I expected, the variable i should be constant as well after unroll. But the nvcc cannot make the variable i constant in compiling, so the foo(i) executes during running not compiling.

The problem caused is not only about slight performance loss, but correctness!

For example, the function foo use a constexpr array A in host.

After CUDA 8, the constexpr variable in host can be used by device function if its address is not in need during running. However, the variable i is not recognized as constant, nor compiler optimized foo(i) in compiling. Even worse, the nvcc doesn’t try to link array A which is in host, so the behavior of foo(i) in running is UNDEFINED!

An alternative method is making use of constant memory in CUDA, but the performance loss cannot be erased.

I don’t see any evidence of that in the cases I have tried that look like what you have shown.

You’ve made a claim without any evidence to support it.

I don’t believe what you are saying is true. All of the 6 or so test cases I have just put together show that the compiler is well able to recognize that N is constant and do appropriate optimizations. The loop is gone, and any call to foo is gone from what I can see.

I didn’t say the N cannot be recognized as constant. What I said is the loop counter, that is i. The i is in a fixed range of 0 -> N - 1, so, the i should be used as constant.

You’ve provided no evidence to support that claim.

My tests show that i is recognized as constant (within the context of each loop iteration), the loop is unrolled, and the compiler applies all kinds of optimizations and removes the loop and any call to foo entirely.

It’s OK if you don’t believe me. Here’s my test case:

$ cat t24.cu
#include <stdio.h>

constexpr int N = 3;

__device__ int foo(int i, int k) {
  return i*k;
}

__global__ void k(int j, int w){

  int val = j;
#pragma unroll
  for (int i = 0; i < N; i++){
    val += foo(i,j)-w;}
  if (val == 0) printf("val == 0\n");
}

$ nvcc -c t24.cu
$ cuobjdump -sass ./t24.o

Fatbin elf code:
================
arch = sm_30
code version = [1,7]
producer = <unknown>
host = linux
compile_size = 64bit

        code for sm_30
                Function : _Z1kii
        .headerflags    @"EF_CUDA_SM30 EF_CUDA_PTX_SM(EF_CUDA_SM30)"
                                                                     /* 0x22f2c28202804307 */
        /*0008*/                   MOV R1, c[0x0][0x44];             /* 0x2800400110005de4 */
        /*0010*/                   MOV R0, c[0x0][0x140];            /* 0x2800400500001de4 */
        /*0018*/                   MOV R3, c[0x0][0x144];            /* 0x280040051000dde4 */
        /*0020*/                   SHL R0, R0, 0x2;                  /* 0x6000c00008001c03 */
        /*0028*/                   IMUL32I R3, R3, 0x3;              /* 0x100000000c30dca2 */
        /*0030*/                   ISETP.NE.AND P0, PT, R0, R3, PT;  /* 0x1a8e00000c01dc23 */
        /*0038*/               @P0 EXIT;                             /* 0x80000000000001e7 */
                                                                     /* 0x2002f2f2f0420047 */
        /*0048*/                   MOV32I R4, 0x0;                   /* 0x1800000000011de2 */
        /*0050*/                   MOV R6, RZ;                       /* 0x28000000fc019de4 */
        /*0058*/                   MOV32I R5, 0x0;                   /* 0x1800000000015de2 */
        /*0060*/                   MOV R7, RZ;                       /* 0x28000000fc01dde4 */
        /*0068*/                   JCAL 0x0;                         /* 0x1000000000011c07 */
        /*0070*/                   EXIT;                             /* 0x8000000000001de7 */
        /*0078*/                   BRA 0x78;                         /* 0x4003ffffe0001de7 */
                .................

Fatbin ptx code:
================
arch = sm_30
code version = [6,4]
producer = <unknown>
host = linux
compile_size = 64bit
compressed
$

You can’t get to this level of optimization without recognizing that:

  • N is constant and therefore the loop is unrolled
  • for each unrolled loop iteration, i is a known value at compile time
  • furthermore, the arithmetic can now be deduced for each iteration
  • the arithmetic at each iteration is now constant, so can be replaced with a fixed value based on the inputs j and w
  • the fixed values for each loop iteration can all be summed
  • the final value can be computed simply as a sum of 2 products

The compiler is doing all that.

So N is recongnized as globally constant, and the value of i is known at compile time, at each loop iteration.

The sample code is as follows.

template <typename T>
struct Foo
{
    static constexpr T f1(int N, int i)
    {
        return arr[N][i];
    }

    static constexpr T arr[5][5] = { /* some value */ };
}

Best I can tell, that’s a code snippet without a loop. Are you using a release build for your experiments? Debug builds have all optimizations disabled.

Yes, all builds are in release mode.

My suggestion would be to post a minimal, self-contained, complete (buildable and runable) example code along with CUDA version and command line used to invoke the compiler, if you want others to look at code generation issues, desire help in debugging, etc.