Bug in dwtHaar1D SDK example?

Hey there,

I couldn’t find out who the author of the SDK example “dwtHaar1D” is, so I’ll just post it here…

I think there is a little bug in the kernel if the variable dlevel is 1 which happens e.g. if the signal is 2^11 or 2^21:

__global__ void 

dwtHaar1D( float* id, float* od, float* approx_final, 

          const unsigned int dlevels,

          const unsigned int slength_step_half,

          const int bdim ) 





    // approximation coefficient

    // store in shared memory for further decomposition steps in this global step

    shared[atid] = (data0 + data1) * INV_SQRT_2;

   // all threads have to write approximation coefficient to shared memory before 

    // next steps can take place


   // early out if possible

    // the compiler removes this part from the source because dlevels is 

    // a constant shader input

    // note: syncthreads in bodies of branches can lead to dead-locks unless the

    // the condition evaluates the same way for ALL threads of a block, as in 

    // this case

    if( dlevels > 1) 





        // write the top most level element for the next decomposition steps

        // which are performed after an interblock syncronization on host side

        if( 0 == tid) 


            approx_final[bid] = shared[0];


    } // end early out if possible


In this case the if statement gets not executed thus the last “if( 0 ==tid)” gets not executed and the shared memory (approximation coefficient) is not written back to global memory.

I have verified this and it indeed happens.

Moving the last if-statement out one level so it gets executed every time the kernel gets called (should) solve the problem.

PS: Anybody got a 2D Wavelet implementation?

Thanks for the post. I also think this is a bug.

I recognize I am replying to this post one year later but I found some additional bugs and I thought this could help other people. In dwtHaar1D.cu:

@@ -297,28 +314,36 @@


		 // update level variables

-		if( dlevels_left < 10) 

+		if( dlevels_left <= 10) 


			 // approx_final = d_odata;

			 approx_is_input = 0;


		 // more global steps necessary

-		dlevels_step = (dlevels_left > 10) ? dlevels_left - 10 : dlevels_left;

+		dlevels_step = (dlevels_left > 10) ? 10 : dlevels_left;

		 dlevels_left -= 10; 

		 // after each step only half the threads are used any longer

		 // therefore after 10 steps 2^10 less threads

		 num_threads_total_left = num_threads_total_left >> 10;

I hope NVIDIA reviews these changes and commits them to the next version of the SDK.



FYI the DWT in the NVIDIA samples is 8x8 whereas CUVI Lib offers general 2D DWT Haar function. The calling pattern would be:

→ Initialize data and copy it to device.
→ Call DWT Context Initialization function (To initialize high pass and low pass filter taps, their length and anchor position)
→ Call DWT function which launches the kernel and computes DWT on GPU
→ Free any self allocated memory (All the memory allocated by the function is freed by itself)

The procedure is same for Inverse DWT. The library is free and it works on its own as well as with NVPP