I’m trying to convert Viola Jones face detection code using cuda kernels. The flow of the algorithm is as stated:(I’m using opencv frontalface cascades)
for loop1:: for scaling cascades from 20x20 window size to 1280x1024(image size) in steps of 1.1(scalefactor)–loop1 runs approx. 41 times
for loop2:: for detection window traversing in Y-direction–loop2 runs approx. 100 times
for loop3:: for detection window traversing in X-direction–loop3 runs approx. 100 times
for loop4:: evaluating stages(22 in frontalface cascade) in a single window
for loop5:: evaluating classifier filters(2135 in 22 stages for frontalface cascade) in single window
– combine loop4 and loop5 runs 2135 times.
I have converted for loop2 and 3 as one kernel,and kept loop1 as it is and have called a host____device evaluate function for loop5. Hence now the detection window runs in parallel(loop 2 and loop 3 ).
However, this doesn’t give me performance in time when compared to my CPU code.
Any help is kindly appreciated…!!!