Suppose I have the following generic code for performing a prefix-sum across a warp:
int t = threadIdx.x;
Type Sum;
__shared__ Type pShared[16+32];
volatile Type* pScratch = pShared + 16;
pScratch[t-16] = Type();
pScratch[t] = Sum = pValue[t];
pScratch[t] = Sum = Sum + pScratch[t-1];
pScratch[t] = Sum = Sum + pScratch[t-2];
pScratch[t] = Sum = Sum + pScratch[t-4];
pScratch[t] = Sum = Sum + pScratch[t-8];
Sum = Sum + pScratch[t-16];
The pointer to shared memory must be volatile, otherwise the compiler won’t actually write the partial sum to memory at each step. This code works fine for primitive types (int, float, etc.). For vector types it is necessary to define the ‘+’ operator. For example, for int2 we have:
__device__ inline
int2 operator+( const int2& a, const int2& b )
{
return make_int2( a.x + b.x, a.y + b.y );
}
The problem is that when Type is not primitive the compiler complains that there is no operator for ‘volatile Type = Type’. I don’t think it is possible to define operator=() for the vector types without changing the cuda header files, which I don’t want to do. Even if I were to do that, I am not sure what the signature for that operator should look like. Nothing that I have tried has worked.
Does anyone know a clean way to deal with this problem?