fast process to get pyramid image(shrink image) and integral image by cuda.

My cpu is core i7 3.4GHz 8 cores and gpu is Geforce gtx 750.
I am implementing the face recognition algorithm by cuda to reduce the time performance.
The bottle neck is the part that get the pyramid and integral image.
Original image size is 5MP and shrink rate is 1.2
I have to get the time 50ms at much during this process. but current time duration is 150ms.
have you ever had experience about this? or do you have any suggestion?
Thanks in advance.

*push