trying to use unified memory for facedetection cascades

I’m trying to carry all cascades of viola-jones facedetection on GPU and have done cudaMallocManaged (i.e. unified memory) for that… And have written a kernel to evaluate those cascades in GPU…The kernel is executing properly will I print its values but the algorithm has become tremendously slow and requires time in minutes to completely evaluate a single frame… Is this because of unified memory…Will it improve if I use cudaMalloc instead of cudaMallocManaged???
Any help if kindly appreciable…!!!

Hi AkashNebhwani,

Since the cudaMallocmanaged is page-locked, it’s as opposed to regular pageable host memory allocated by malloc(), to use page-locked host memory has several benefits, but consuming too much page-locked memory also might reduces overall system performance.
You could refer to:
http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#page-locked-host-memory

Thanks

Thanks kayccc !!!

I now have also tried cascades directly on GPU using cudaMalloc, but still I’m not getting the expected profile timing…I’m searching for various optimised ways to evaluate Viola Jones parallely on GPU…
Any brainstormed results or links is appreciable…!!!

Hi AkashNebhwani,

Some threads discussed it, but not sure if someone successfully have a viola-jones facedetection demo in CUDA.

Besides, maybe you could try to put TK1 CPU&GPU in max performance stage to see if getting improvement, see the link, http://elinux.org/Jetson/Performance

Thanks