Accelerating image filter with OpenACC

Hello Mat,


Sorry for the repost, I really appreciate your guiding here. I have this filter written with opencv. The program compiles for multicore architecture and there is a ~x4 speedup (with 4 cores). When I try to use GPU acceleration in my K40 I get a slow down of execution time. This is the repo with the code https://github.com/manu3193/DNLM-P/tree/opencv-fft-openacc/src.

It takes arround 20s denoising an image with size 256x256 pixels, on a Xeon quad core with AVX2 at 3.3 GHz. The command I used is

./nlmfilter_multicore_fft -w 7 -n 3 -s 0.5 256x256.png

The following is the displayed information by pgi compiler.

DNLM_OpenACC(const float *, int, const float *, int, float *, int, int, int, int, int, int, int, int, int, float):
     34, Generating Multicore code
         36, #pragma acc loop gang
     66, Accelerator restriction: size of the GPU copy of pWindowIJCorr,pEuclDist is unknown
         Loop is parallelizable
     68, Loop is parallelizable
     84, Accelerator restriction: size of the GPU copy of pEuclDist is unknown
         Loop is parallelizable
     86, Loop is parallelizable
     97, Accelerator restriction: size of the GPU copy of pEuclDist is unknown
         Loop is parallelizable
     99, Loop is parallelizable

Thanks in advance.

Hi manu3193,

I’m mentoring an all day GPU Hackathon today and tomorrow, and booked with meetings on Thursday. I’ll try to find some time to take a look, but it may not be till Friday before I can do anything in depth.

In the meantime, have you tried profiling the code? It might be useful to understand where the performance bottle necks are.

First try using the simple PGI runtime profiler by setting the environment variable PGI_ACC_TIME=1. This will show if the bottle neck is data movement or compute.

Next, use Nsight-Compute to get a detailed profile of the kernel to understand what’s happening.

Note that K40s have limited double precision floating point units. Later generations greatly improve this, but if your application allows it, you may try converting to using single precision.

-Mat