Possible nvcc compiler error

Greetings.

Here is my problem: my kernel starts to produce garbage, when i add a totally innocent line, which should not affect anything.

I have double and triple-checked, and there’s no error in the algorithm itself, and it runs ok in emulator mode. so i assume this is a compiler error.

The ptx contains very long completely unrolled loop (this is only a small part):

min.s32 	%r83, %r69, %r82;

min.s32 	%r84, %r64, %r83;

min.s32 	%r85, %r59, %r84;

min.s32 	%r86, %r54, %r85;

min.s32 	%r87, %r49, %r86;

Is it possible that the problem is related to the insane number of registers used by this code? I compile with --maxrregcount 10, but that doesn’t seem to make a difference.

Without --maxrregcount it says “reg = 33” in the cubin, but the ptx says there’s a lot more:

.reg .u16 %rh<4>;

.reg .u32 %r<91>;

.reg .u64 %rd<18>;

.reg .pred %p<8>;

.shared .align 4 .b8 __cuda_shared40[1152];

I know how to get rid of the problem in this particular case, but i just want to understand what’s happening and make sure there’s no compiler error.

Here is the reduced to a minimum version of the kernel in question:

[codebox]global void encodeResidualGPU(

int*output,

int*samples,

int*allcoefs,

int*shifts,

int frameSize,

int frameOffset)

{

__shared__ struct {

int residual[256];

int coefs[32];

} shared;

const int tid = threadIdx.x + threadIdx.y * blockDim.x;

const int step = blockDim.x * (blockDim.y - 1);

int total = 0;

shared.residual[tid] = 0; // when this line is removed, or moved elsewhere, the bug dissapears.

if (threadIdx.y == 0) shared.coefs[threadIdx.x] = allcoefs[threadIdx.x];

__syncthreads();

for(int pos = 0; pos < frameSize - blockIdx.x - 1; pos += step)

{

shared.residual[tid] = tid;

__syncthreads();

for(unsigned int s=blockDim.y/2; s > 0; s>>=1)

{

    if (threadIdx.y < s)

	shared.residual[tid] += shared.residual[tid + s * blockDim.x];

    __syncthreads();

}

__syncthreads();

int best = 0x7fffffff;

for (int k = 0; k < 15; k++)

    best = min(best, step * (k + 1) + ((shared.residual[0] - (step >> 1)) >> k));

total += best;

__syncthreads();

}

output[blockIdx.x + blockIdx.y * gridDim.x] = total;

}

[/codebox]

That’s how i invoke it:

dim3 dimBlock(32, 8);

dim3 dimGrid(31, 1, 1);

encodeResidualGPU<<<dimGrid, dimBlock>>>(d_res, d_data, d_coefs, d_shifts, 4608, 4608);

When running correctly, it fills d_res with a number 17640.

The bug causes it to fill d_res with garbage (with some occasional 17640 occurences).

I’m using GTS 250 with OS Vista x64, and CUDA toolkit 2.3 (x64).