constexpr int N = 3;
#prgram unroll
for (int i = 0; i < N; i++)
{
foo(i);
}
It’s clear that N is constant. As I expected, the variable i should be constant as well after unroll. But the nvcc cannot make the variable i constant in compiling, so the foo(i) executes during running not compiling.
The problem caused is not only about slight performance loss, but correctness!
For example, the function foo use a constexpr array A in host.
After CUDA 8, the constexpr variable in host can be used by device function if its address is not in need during running. However, the variable i is not recognized as constant, nor compiler optimized foo(i) in compiling. Even worse, the nvcc doesn’t try to link array A which is in host, so the behavior of foo(i) in running is UNDEFINED!
An alternative method is making use of constant memory in CUDA, but the performance loss cannot be erased.
I don’t see any evidence of that in the cases I have tried that look like what you have shown.
You’ve made a claim without any evidence to support it.
I don’t believe what you are saying is true. All of the 6 or so test cases I have just put together show that the compiler is well able to recognize that N is constant and do appropriate optimizations. The loop is gone, and any call to foo is gone from what I can see.
I didn’t say the N cannot be recognized as constant. What I said is the loop counter, that is i. The i is in a fixed range of 0 → N - 1, so, the i should be used as constant.
You’ve provided no evidence to support that claim.
My tests show that i is recognized as constant (within the context of each loop iteration), the loop is unrolled, and the compiler applies all kinds of optimizations and removes the loop and any call to foo entirely.
It’s OK if you don’t believe me. Here’s my test case:
Best I can tell, that’s a code snippet without a loop. Are you using a release build for your experiments? Debug builds have all optimizations disabled.
My suggestion would be to post a minimal, self-contained, complete (buildable and runable) example code along with CUDA version and command line used to invoke the compiler, if you want others to look at code generation issues, desire help in debugging, etc.