Hello Mat,
Sorry for the repost, I really appreciate your guiding here. I have this filter written with opencv. The program compiles for multicore architecture and there is a ~x4 speedup (with 4 cores). When I try to use GPU acceleration in my K40 I get a slow down of execution time. This is the repo with the code DNLM-P/src at opencv-fft-openacc · manu3193/DNLM-P · GitHub.
It takes arround 20s denoising an image with size 256x256 pixels, on a Xeon quad core with AVX2 at 3.3 GHz. The command I used is
./nlmfilter_multicore_fft -w 7 -n 3 -s 0.5 256x256.png
The following is the displayed information by pgi compiler.
DNLM_OpenACC(const float *, int, const float *, int, float *, int, int, int, int, int, int, int, int, int, float):
34, Generating Multicore code
36, #pragma acc loop gang
66, Accelerator restriction: size of the GPU copy of pWindowIJCorr,pEuclDist is unknown
Loop is parallelizable
68, Loop is parallelizable
84, Accelerator restriction: size of the GPU copy of pEuclDist is unknown
Loop is parallelizable
86, Loop is parallelizable
97, Accelerator restriction: size of the GPU copy of pEuclDist is unknown
Loop is parallelizable
99, Loop is parallelizable
Thanks in advance.