compiler bug?

Hello,

I have a problem compiling some code, which looks like a cuda bug for me. I hope this is the right place for that.

[codebox]

#pragma unroll

for(int j=0; j<(BLOCK_DIM_X<<5); ++j)

fSum += pfSMTmp[threadIdx.x + (j<<5)];

[/codebox]

This gives:

nvopencc ERROR: C:\CUDA\bin/…/open64/lib//be.exe returned non-zero status -1073741819

1>nvopencc INTERNAL ERROR: cannot unlink temp file C:/DOKUME~1/cjohn/LOKALE~1/Temp/ccBI#.a03880

Looks like it is related to pragma unroll, and the shift operation in (BLOCK_DIM_X<<5). Without #pragma unroll everything is fine and with

(BLOCK_DIM_X/32) the code works fine as well.

Cheers

Christoph

Sorry this was my fault, the shift should of course be that way (>>), than it works like it should.

I’ve stumbled onto something similar

__device__ int count(int a, int b) {

	int c = a*b;

	

	return c;

}

__global__ void testKernel1(float* a, float *b, float *c)

{

	int tid = threadIdx.x + blockIdx.x*blockDim.x;

	int ia = a[tid];

	int ib = b[tid];

	int ic = c[tid];

	

	//#pragma unroll 

	for(int i=0; i<2048; i++) {

		ic=count(ia,ib);

		ia=count(ib,ic);

		ib=count(ia,ic);

	}

	c[tid] = ic;

	a[tid] = ia;

	b[tid] = ib;

	

}

When I uncomment the pragma, I get those nvopencc errors (same as yours, can’t unlink temp file…).

Any ideas?

I’m using CUDA 2.0 on Windows XP 32 and VS 2005.

Hello,
the problem is that the preprocessor cannot unroll all your loops which is 2048 times in your code. The compiler just crashes instead of reporting an error when unrolling. Try unroling a smaller amount of loops. Like with

#pragma unroll 10
This should unroll only the first 10 loops.

You will have to find the hard limit of loops you can unroll yourself. As far as I know the max possible count depends on the register usage and code size of your kernel. Therfore no general hard limit here.

Thanks, you were right.

Unrolling up to about 45 benefits runtime while trying to unroll further causes the performance to actually gradually drop even though inspecting the .cubin file reveals no additional registers used (7 reg, 0 lmem). I presume I’m hitting the L1 cache limit? The binary data in .cubin comprises of about 500 lines, each having 4 32-bit instructions/operands, totalling slightly less than 8KB of data.