Hey there,
I couldn’t find out who the author of the SDK example “dwtHaar1D” is, so I’ll just post it here…
I think there is a little bug in the kernel if the variable dlevel is 1 which happens e.g. if the signal is 2^11 or 2^21:
__global__ void
dwtHaar1D( float* id, float* od, float* approx_final,
const unsigned int dlevels,
const unsigned int slength_step_half,
const int bdim )
{
.
.
.
// approximation coefficient
// store in shared memory for further decomposition steps in this global step
shared[atid] = (data0 + data1) * INV_SQRT_2;
// all threads have to write approximation coefficient to shared memory before
// next steps can take place
__syncthreads();
// early out if possible
// the compiler removes this part from the source because dlevels is
// a constant shader input
// note: syncthreads in bodies of branches can lead to dead-locks unless the
// the condition evaluates the same way for ALL threads of a block, as in
// this case
if( dlevels > 1)
{
.
.
.
// write the top most level element for the next decomposition steps
// which are performed after an interblock syncronization on host side
if( 0 == tid)
{
approx_final[bid] = shared[0];
}
} // end early out if possible
}
In this case the if statement gets not executed thus the last “if( 0 ==tid)” gets not executed and the shared memory (approximation coefficient) is not written back to global memory.
I have verified this and it indeed happens.
Moving the last if-statement out one level so it gets executed every time the kernel gets called (should) solve the problem.
PS: Anybody got a 2D Wavelet implementation?