DWT Implementation in GPU

I have been trying to implement DWT in GPU. It works well for a 512512 image. But for 1k1k image, the forward DWT works well , while I get a noisy image from Inverse.
I am copying the entire image to GPU and doing the processing. When I work on 1k image, I have declared 2 blocks of 512 threads each.If I use only 512 threads for my 1k implementation , it works well for even 1k image.
No error could be detected. I have not used any shared memory and am using only global memory.I am using Quadro FX 4800.No registers as well.
Iam working in Linux 5.2
What could be the problem in working on any number of threads above 512?
Could anyone pls help?
Thanx in advance.

Your GPU is of compute capability 1.3 and thus has a maximum of 512 threads per block.


I am using only 512 threads /block. Iam only trying to increase the number of blocks. nO error is displayed , but the output is wrong.

works fine for 512 threads.

Pls help