__device__ int count(int a, int b) {
int c = a*b;
return c;
}
__global__ void testKernel1(float* a, float *b, float *c)
{
int tid = threadIdx.x + blockIdx.x*blockDim.x;
int ia = a[tid];
int ib = b[tid];
int ic = c[tid];
//#pragma unroll
for(int i=0; i<2048; i++) {
ic=count(ia,ib);
ia=count(ib,ic);
ib=count(ia,ic);
}
c[tid] = ic;
a[tid] = ia;
b[tid] = ib;
}
When I uncomment the pragma, I get those nvopencc errors (same as yours, can’t unlink temp file…).
Hello,
the problem is that the preprocessor cannot unroll all your loops which is 2048 times in your code. The compiler just crashes instead of reporting an error when unrolling. Try unroling a smaller amount of loops. Like with
#pragma unroll 10
This should unroll only the first 10 loops.
You will have to find the hard limit of loops you can unroll yourself. As far as I know the max possible count depends on the register usage and code size of your kernel. Therfore no general hard limit here.
Unrolling up to about 45 benefits runtime while trying to unroll further causes the performance to actually gradually drop even though inspecting the .cubin file reveals no additional registers used (7 reg, 0 lmem). I presume I’m hitting the L1 cache limit? The binary data in .cubin comprises of about 500 lines, each having 4 32-bit instructions/operands, totalling slightly less than 8KB of data.