Possible nvcc compiler error

Gregory_S_Chudov · September 8, 2009, 12:21am

Greetings.

Here is my problem: my kernel starts to produce garbage, when i add a totally innocent line, which should not affect anything.

I have double and triple-checked, and there’s no error in the algorithm itself, and it runs ok in emulator mode. so i assume this is a compiler error.

The ptx contains very long completely unrolled loop (this is only a small part):

min.s32 	%r83, %r69, %r82;

min.s32 	%r84, %r64, %r83;

min.s32 	%r85, %r59, %r84;

min.s32 	%r86, %r54, %r85;

min.s32 	%r87, %r49, %r86;

Is it possible that the problem is related to the insane number of registers used by this code? I compile with --maxrregcount 10, but that doesn’t seem to make a difference.

Without --maxrregcount it says “reg = 33” in the cubin, but the ptx says there’s a lot more:

.reg .u16 %rh<4>;

.reg .u32 %r<91>;

.reg .u64 %rd<18>;

.reg .pred %p<8>;

.shared .align 4 .b8 __cuda_shared40[1152];

I know how to get rid of the problem in this particular case, but i just want to understand what’s happening and make sure there’s no compiler error.

Here is the reduced to a minimum version of the kernel in question:

[codebox]global void encodeResidualGPU(

int*output,

int*samples,

int*allcoefs,

int*shifts,

int frameSize,

int frameOffset)

{

__shared__ struct {

int residual[256];

int coefs[32];

} shared;

const int tid = threadIdx.x + threadIdx.y * blockDim.x;

const int step = blockDim.x * (blockDim.y - 1);

int total = 0;

shared.residual[tid] = 0; // when this line is removed, or moved elsewhere, the bug dissapears.

if (threadIdx.y == 0) shared.coefs[threadIdx.x] = allcoefs[threadIdx.x];

__syncthreads();

for(int pos = 0; pos < frameSize - blockIdx.x - 1; pos += step)

{

shared.residual[tid] = tid;

__syncthreads();

for(unsigned int s=blockDim.y/2; s > 0; s>>=1)

{

    if (threadIdx.y < s)

	shared.residual[tid] += shared.residual[tid + s * blockDim.x];

    __syncthreads();

}

__syncthreads();

int best = 0x7fffffff;

for (int k = 0; k < 15; k++)

    best = min(best, step * (k + 1) + ((shared.residual[0] - (step >> 1)) >> k));

total += best;

__syncthreads();

}

output[blockIdx.x + blockIdx.y * gridDim.x] = total;

}

[/codebox]

That’s how i invoke it:

dim3 dimBlock(32, 8);

dim3 dimGrid(31, 1, 1);

encodeResidualGPU<<<dimGrid, dimBlock>>>(d_res, d_data, d_coefs, d_shifts, 4608, 4608);

When running correctly, it fills d_res with a number 17640.

The bug causes it to fill d_res with garbage (with some occasional 17640 occurences).

I’m using GTS 250 with OS Vista x64, and CUDA toolkit 2.3 (x64).

Topic		Replies	Views
When -maxrregcount option is used, kernel fail to run CUDA Programming and Performance	8	14635	February 10, 2011
Compiler Bug ? Position of statement causes program to fail! The statement is "isolated" CUDA Programming and Performance	8	1691	November 10, 2009
Internal error CUDA Programming and Performance	0	3206	August 24, 2007
Error: ran out of registers CUDA Programming and Performance	9	11664	January 12, 2009
Register usage Understanding -ptx and -cubin CUDA Programming and Performance	11	5480	July 24, 2007
NVRM: Xid (0084:00) kernel does not terminate CUDA Programming and Performance	16	8826	April 10, 2008
Ways to reduce number of registers CUDA Programming and Performance	14	20018	January 16, 2008
ran out of registers in predicate nvopencc error CUDA Programming and Performance	3	4180	September 2, 2007
register count frustration CUDA Programming and Performance	4	4476	September 29, 2011
Compiler Bug: Assertion failure at line 1923 of ../../be/cg/cgemit.cxx incorrect reg CUDA Programming and Performance	3	1068	June 17, 2011

Possible nvcc compiler error

Related topics