Greetings.
Here is my problem: my kernel starts to produce garbage, when i add a totally innocent line, which should not affect anything.
I have double and triple-checked, and there’s no error in the algorithm itself, and it runs ok in emulator mode. so i assume this is a compiler error.
The ptx contains very long completely unrolled loop (this is only a small part):
min.s32 %r83, %r69, %r82;
min.s32 %r84, %r64, %r83;
min.s32 %r85, %r59, %r84;
min.s32 %r86, %r54, %r85;
min.s32 %r87, %r49, %r86;
Is it possible that the problem is related to the insane number of registers used by this code? I compile with --maxrregcount 10, but that doesn’t seem to make a difference.
Without --maxrregcount it says “reg = 33” in the cubin, but the ptx says there’s a lot more:
.reg .u16 %rh<4>;
.reg .u32 %r<91>;
.reg .u64 %rd<18>;
.reg .pred %p<8>;
.shared .align 4 .b8 __cuda_shared40[1152];
I know how to get rid of the problem in this particular case, but i just want to understand what’s happening and make sure there’s no compiler error.
Here is the reduced to a minimum version of the kernel in question:
[codebox]global void encodeResidualGPU(
int*output,
int*samples,
int*allcoefs,
int*shifts,
int frameSize,
int frameOffset)
{
__shared__ struct {
int residual[256];
int coefs[32];
} shared;
const int tid = threadIdx.x + threadIdx.y * blockDim.x;
const int step = blockDim.x * (blockDim.y - 1);
int total = 0;
shared.residual[tid] = 0; // when this line is removed, or moved elsewhere, the bug dissapears.
if (threadIdx.y == 0) shared.coefs[threadIdx.x] = allcoefs[threadIdx.x];
__syncthreads();
for(int pos = 0; pos < frameSize - blockIdx.x - 1; pos += step)
{
shared.residual[tid] = tid;
__syncthreads();
for(unsigned int s=blockDim.y/2; s > 0; s>>=1)
{
if (threadIdx.y < s)
shared.residual[tid] += shared.residual[tid + s * blockDim.x];
__syncthreads();
}
__syncthreads();
int best = 0x7fffffff;
for (int k = 0; k < 15; k++)
best = min(best, step * (k + 1) + ((shared.residual[0] - (step >> 1)) >> k));
total += best;
__syncthreads();
}
output[blockIdx.x + blockIdx.y * gridDim.x] = total;
}
[/codebox]
That’s how i invoke it:
dim3 dimBlock(32, 8);
dim3 dimGrid(31, 1, 1);
encodeResidualGPU<<<dimGrid, dimBlock>>>(d_res, d_data, d_coefs, d_shifts, 4608, 4608);
When running correctly, it fills d_res with a number 17640.
The bug causes it to fill d_res with garbage (with some occasional 17640 occurences).
I’m using GTS 250 with OS Vista x64, and CUDA toolkit 2.3 (x64).