Problem with nppiWarpAffine_32f_C1R on Fermi boards

I have a problem with version 4.2 of CUDA NPP running on 64bit Linux (Ubuntu 10.04LTS).
I have a host with two GPUs: a TESLA C1060 and a GeForce GTX 560 Ti.
I am developing an application working on huge floating point images that at a certain point needs to use the function nppiWarpAffine_32f_C1R. The application behaves correctly when run on the C1060 but does not save anything on the output image when run on the GTX560Ti.
To be sure that the problem was on the specific function, I took the library sample program boxFilterNPP and modified it to use the WarpAffine instead of the original function.
If I use the 8u version of WarpAffine everything is correct on both boards, but if I convert the input image to 32f format and run the 32f version of WarpAffine I obtain the same results as on my application: C1060 ok, while GTX560Ti does not write on the output image.
Anyone has an idea of the problem? Is there a bug in the nppiWarpAffine_32f_C1R or am I missing something?
I really need to work on 32f images and I cannot find a workaround for the problem…
You can find my warpNPP code in http://sally.polito.it/cuda/warpNPP.cpp.
Thanks to anybody providing clues to the problem.

Could you try CUDA 5.0 Production?

Hi, I had tried your sample code with both CUDA 4.2 and CUDA 5.0 Production on sm_11 and sm 30 GPU. Both GPUs work fine with 5.0, but only sm_11 can work with 4.2. So I think this bug has already been fixed.