Query on Converting nested for loops into Cuda

I’m trying to convert Viola Jones face detection code using cuda kernels. The flow of the algorithm is as stated:(I’m using opencv frontalface cascades)

for loop1:: for scaling cascades from 20x20 window size to 1280x1024(image size) in steps of 1.1(scalefactor)–loop1 runs approx. 41 times

for loop2:: for detection window traversing in Y-direction–loop2 runs approx. 100 times

for loop3:: for detection window traversing in X-direction–loop3 runs approx. 100 times

for loop4:: evaluating stages(22 in frontalface cascade) in a single window

for loop5:: evaluating classifier filters(2135 in 22 stages for frontalface cascade) in single window
– combine loop4 and loop5 runs 2135 times.

I have converted for loop2 and 3 as one kernel,and kept loop1 as it is and have called a host____device evaluate function for loop5. Hence now the detection window runs in parallel(loop 2 and loop 3 ).
However, this doesn’t give me performance in time when compared to my CPU code.

Any help is kindly appreciated…!!!