Parallel sum, arch sm_21, xid

Hi,

I am having a very strange problem on a GT460 1GB.
I am doing parallel sum of matrices of different sizes using the following kernel (instaziated with “float” for class T):

template static global void sum_kernel(const T g_idata, const int n, T sum)
{
SharedMemory shared;
T* temp = shared.getPointer();

    int thid = threadIdx.x; 

    int offset = 1; 

    int ai = thid; 
    int bi = thid + (n/2); 

    int bankOffsetA = CONFLICT_FREE_OFFSET(ai); 
    int bankOffsetB = CONFLICT_FREE_OFFSET(ai); 

    temp[ai + bankOffsetA] = g_idata[ai];  
    temp[bi + bankOffsetB] = g_idata[bi];  

    for (int d = n>>1; d > 0; d >>= 1) // build sum in place up the tree 
    { 
            __syncthreads(); 

            if (thid < d)    
            { 
                    int ai = offset*(2*thid+1)-1; 
                    int bi = offset*(2*thid+2)-1; 
                    ai += CONFLICT_FREE_OFFSET(ai); 
                    bi += CONFLICT_FREE_OFFSET(bi); 

                    temp[bi] += temp[ai];         
            } 
            offset *= 2; 
    } 
    
    //Save sum in array
    __syncthreads();
    sum[0] += temp[n-1];

}

which is derived from Mark Harris Parallel Prefix Scan code (april 2007)
The problem is that if I run the code compiling with “-arch sm_13” it runs fine, while if I compile
with “-arch sm_21” the execution stops with the following:

unspecified launch failure

and dmesg shows:

NVRM: Xid (0001:00): 13, 0006 00000000 000090c0 00001b0c 00000000 00000000

To make stuff funnier, I modified the above kernel as follows:

template static global void sum_kernel(const T g_idata, const int n, T sum)
{
SharedMemory shared;
T* temp = shared.getPointer();

    int thid = threadIdx.x; 

    int offset = 1; 

    int ai = thid; 
    int bi = thid + (n/2); 

    int bankOffsetA = CONFLICT_FREE_OFFSET(ai); 
    int bankOffsetB = CONFLICT_FREE_OFFSET(ai); 

    temp[ai + bankOffsetA] = g_idata[ai];  
    temp[bi + bankOffsetB] = g_idata[bi];  

    for (int d = n>>1; d > 0; d >>= 1) // build sum in place up the tree 
    { 
            __syncthreads(); 

            if (thid < d)    
            { 
                    int ai = offset*(2*thid+1)-1; 
                    int bi = offset*(2*thid+2)-1; 
                    ai += CONFLICT_FREE_OFFSET(ai); 
                    bi += CONFLICT_FREE_OFFSET(bi); 

                    temp[bi] += temp[ai];         
            } 
            offset *= 2; 
    } 
    
    //Save sum in array
    __syncthreads();
    sum[0] += temp[n-1];
    

if ( thid == 0 ) 
	temp [ n - 1 + CONFLICT_FREE_OFFSET(n - 1) ] = 0; 

for (int d = 1; d < n; d *= 2) // traverse down tree & build scan 
    { 
	offset >>= 1; 
	__syncthreads(); 
	
	if (thid < d) 
	{
		
		int ai = offset*(2*thid+1)-1; 
		int bi = offset*(2*thid+2)-1; 
		ai += CONFLICT_FREE_OFFSET(ai);bi += CONFLICT_FREE_OFFSET(bi);
		
		float t   = temp[ai]; 
		temp[ai]  = temp[bi]; 
		temp[bi] += t; 
	} 
} 

__syncthreads();

}

the part added belongs to Mark Harris Parallel Prefix Scan code, and if I compile and run
everything works correctly also with the “-arch sm_21” flag.

I do not understand why this happens since the second part of the code sorts the sum tree to obtain
an exclusive prefix scan (cumulative sum)

BTW, the code runs flawless on a gf9400m (mcbook pro), compiled