DWT Implementation in GPU

I have been trying to implement DWT in GPU. It works well for a 512512 image. But for 1k1k image, the forward DWT works well , while I get a noisy image from Inverse.
I am copying the entire image to GPU and doing the processing. When I work on 1k image, I have declared 2 blocks of 512 threads each.If I use only 512 threads for my 1k implementation , it works well for even 1k image.
No error could be detected. I have not used any shared memory and am using only global memory.I am using Quadro FX 4800 .What could be the problem in working on any number of threads above 512?
Could anyone pls help?
Thanx in advance.