executing failure about sample ImageDenosing

In the sample of ImageDenosing, I changed the BLOCKDIM_X and BLOCKDIM_Y from 88 to 1616, then the threads in a block can be 256 instead of 64, but when running that changed sample, the executing failure when running the NLM2 kernel. I checked the program and do not think there are some logic error. I think the reason may lies in the shared memory using, and any one can tell me why the change cause failure?

welcome any reply.

BLOCKDIM_X and BLOCKDIM_Y changed to 16 works fine for me.
Did you change anything else in the code?

No, I only changed that place, and if I changed it back to 8, all works fine. In the program which BLOCKDIM_X and BLOCKDIM_Y are changed to 16, the KNN and NLM algorithm works fine, but the NLM2 cause the executing failure, after the failure executing, running the original program also can not success. more strange is that if I continue running another CUDA sample which need huge computing such as particles, the computer go to Blue Screen Of Death, and saying nv4_disp may cause the error. after system reboot, the program executing just become right. only if running the changed program, problem comes back.

BTW, my graphic card is Quadro FX5600, and my driver and SDK are 1.0.