I think the problem you are running into is that the details of pragma
handling are not defined by the C++ standard, and therefore a portable solution may not be possible, however I do not know that for sure.
The preprocessor comprises multiple stages, and this is where macro magic happens, but it is not clear which stage handles pragma
. For the record, the following compiles fine for me with CUDA 12.3 on Windows using MSVC 2019 as the host compiler, and the disassembly shows the desired amount of unrolling (I tried factors of 1, 2, 4, and 8):
#define UNROLL_FACTOR 4
__global__ void kernel (const float * __restrict__ x,
const float * __restrict__ y,
float * __restrict__ z,
int len)
{
int stride = gridDim.x * blockDim.x;
int tid = blockDim.x * blockIdx.x + threadIdx.x;
#pragma unroll UNROLL_FACTOR
for (int i = tid; i < len; i += stride) {
z[i] = x[i] + y[i];
}
}
int main (void)
{
kernel<<<1,1>>>(0,0,0,0);
return 0;
}
There is a predefined C++ macro _Pragma()
that takes a string argument and that is an operator that can be incorporated into macros. This can be used instead of #pragma
. The following works for me in the compilation context specified above:
_Pragma ("unroll " UNROLL_FACTOR)
If this does not work in your compilation environment, depending on which pre-processor phase handles pragma
you may want to experiment with stringifying UNROLL_FACTOR
, etc.
[Later:]
Experimenting a bit, it seems that the pre-processor of the host compiler on Linux (i.e., gcc
) processes #pragma
very early, ahead of any macro expansions, and this seems to apply to _Pragma()
as well. Bummer.