Anyone have implemented the integral image calculation by multi-gpu programming?

I have implemented the integral image calculation by cuda using one gpu in 2 ways.
One is using npp library and other is my own implementation.
But they are all slow, so i extended one more gpu and am trying to implement by multi-gpu.
But it’s not effective than single gpu version.
Is there any library to calculate the integral image by multi-gpu? or anyone have implemented this?
Thanks in advance.

This is outside my area of expertise. What reference is your own implementation based on? How does its speed compare with NPP and what is NPP’s performance? What GPU did you use?

Here are two fairly recent publications that seem relevant, but since I am not familiar with this application domain, I don’t know whether this is of any help:

Berkin Bilgic, Berthold K.P. Horn, and Ichiro Masaki, “Efficient Integral Image Computation on the GPU”. 2010 IEEE Intelligent Vehicles Symposium, San Diego, CA, 21-24 June 2010, pp. 528-533

Marwa Chouchene, Fatma Ezahra Sayadi, Mohamed Atri, and Rached Tourki, “Integral Image Computation on GPU”. 10th International Multi-Conference on Systems, Signals & Devices (SSD), Hammamet, Tunisia, 18-21 March 2013, pp. 1-4

Thanks njuffa.
My gpu is NVIDIA GeForce GTX 750, platform is CUDA6.5, dealing image size is 5MP.
NPP and my implementation by single gpu is almost the same.
I need multi-gpu version implementation or library.