Hi there,
I am having a very strange problem on a GT460 1GB.
I am doing parallel sum of arrays of different sizes using the following kernel (instaziated with “float” for class T):
#define NUM_BANKS 16
#define LOG_NUM_BANKS 4
#define CONFLICT_FREE_OFFSET(n) ((n) >> NUM_BANKS + (n) >> (2 * LOG_NUM_BANKS))
template static global void sum_kernel(const T g_idata, const int n, T sum)
{
SharedMemory shared;
T* temp = shared.getPointer();
int thid = threadIdx.x;
int offset = 1;
int ai = thid;
int bi = thid + (n/2);
int bankOffsetA = CONFLICT_FREE_OFFSET(ai);
int bankOffsetB = CONFLICT_FREE_OFFSET(ai);
temp[ai + bankOffsetA] = g_idata[ai];
temp[bi + bankOffsetB] = g_idata[bi];
for (int d = n>>1; d > 0; d >>= 1) // build sum in place up the tree
{
__syncthreads();
if (thid < d)
{
int ai = offset*(2*thid+1)-1;
int bi = offset*(2*thid+2)-1;
ai += CONFLICT_FREE_OFFSET(ai);
bi += CONFLICT_FREE_OFFSET(bi);
temp[bi] += temp[ai];
}
offset *= 2;
}
//Save sum in array
__syncthreads();
sum[0] += temp[n-1];
}
which is derived from Mark Harris Parallel Prefix Scan code (april 2007)
The problem is that if I run the code compiling with “-arch sm_13” it runs fine, while if I compile
with “-arch sm_21” the execution stops with the following:
unspecified launch failure
and dmesg shows:
NVRM: Xid (0001:00): 13, 0006 00000000 000090c0 00001b0c 00000000 00000000
To make stuff funnier, I modified the above kernel as follows:
template static global void sum_kernel(const T g_idata, const int n, T sum)
{
SharedMemory shared;
T* temp = shared.getPointer();
int thid = threadIdx.x;
int offset = 1;
int ai = thid;
int bi = thid + (n/2);
int bankOffsetA = CONFLICT_FREE_OFFSET(ai);
int bankOffsetB = CONFLICT_FREE_OFFSET(ai);
temp[ai + bankOffsetA] = g_idata[ai];
temp[bi + bankOffsetB] = g_idata[bi];
for (int d = n>>1; d > 0; d >>= 1) // build sum in place up the tree
{
__syncthreads();
if (thid < d)
{
int ai = offset*(2*thid+1)-1;
int bi = offset*(2*thid+2)-1;
ai += CONFLICT_FREE_OFFSET(ai);
bi += CONFLICT_FREE_OFFSET(bi);
temp[bi] += temp[ai];
}
offset *= 2;
}
//Save sum in array
__syncthreads();
sum[0] += temp[n-1];
//new part follows
if ( thid == 0 )
temp [ n - 1 + CONFLICT_FREE_OFFSET(n - 1) ] = 0;
for (int d = 1; d < n; d *= 2) // traverse down tree & build scan
{
offset >>= 1;
__syncthreads();
if (thid < d)
{
int ai = offset*(2*thid+1)-1;
int bi = offset*(2*thid+2)-1;
ai += CONFLICT_FREE_OFFSET(ai);bi += CONFLICT_FREE_OFFSET(bi);
float t = temp[ai];
temp[ai] = temp[bi];
temp[bi] += t;
}
}
__syncthreads();
}
the part added belongs (again) to Mark Harris Parallel Prefix Scan code, and if I compile and run
everything works correctly also with the “-arch sm_21” flag.
I do not understand why this happens since the second part of the code sorts the sum tree to obtain
an exclusive prefix scan (cumulative sum)
BTW, the code runs flawless on a gf9400m (mcbook pro), compiling with “-arch sm_11” and using “float”, without adding the
“spurious” part.
I attach the nvidia bug report
If somebody has a clue…
Cheers
c.