Tiled loop increase MSE error of a nlm denoise filter result compared to the result obtained when using collapse clause

Hello guys,

I am trying to accelerate an image filter with OpenACC. The accelerated code is shown as follows:

#include <algorithm> 
#include <cmath>

void DNLM_OpenACC(const float* pSrcBorder, int stepBytesSrcBorder, const float* pSqrIntegralImage, int stepBytesSqrIntegral, float* pDst, int stepBytesDst, int windowRadius, int neighborRadius, int imageWidth, int imageHeight, int windowWidth, int windowHeight, int neighborWidth, int neighborHeight, float sigma_r)

{

    //Array to store window matrix for euclidean distance
    float * restrict pEuclDist = (float*) malloc(windowHeight * windowWidth * sizeof(float));
    float * restrict pWindowIJCorr = (float*) malloc(windowHeight * windowWidth * sizeof(float)); 

    #pragma acc data deviceptr(pSrcBorder[(windowHeight+2*(windowRadius+neighborRadius))*(windowWidth+2*(windowRadius+neighborRadius))], pSqrIntegralImage[(windowHeight+2*(windowRadius+neighborRadius)+1)*(windowWidth+2*(windowRadius+neighborRadius)+1)], pDst[imageHeight*imageWidth]) create(pEuclDist[0:windowHeight*windowWidth], pWindowIJCorr[windowHeight*windowWidth]) 
    {
        #pragma acc parallel vector_length(32) num_workers(2)   
        {
    	    #pragma acc loop gang collapse(2) private(pEuclDist[0:windowHeight*windowWidth], pWindowIJCorr[0:windowHeight*windowWidth])   
            for(int j = 0; j < imageHeight; j++)
    	    {
                for (int i = 0; i < imageWidth; i++)
                {
                    //Compute base address for each array
                    const int indexPdstBase = j * (stepBytesDst/sizeof(float));
                    const int indexWindowStartBase = j * stepBytesSrcBorder/sizeof(float);
                    const int indexNeighborIJBase = (j + windowRadius) * stepBytesSrcBorder/sizeof(float);
                    const int indexIINeighborIJBase = (j + windowRadius) * stepBytesSqrIntegral/sizeof(float);
                    const int indexIINeighborIJBaseWOffset = (j + windowRadius + neighborWidth) * stepBytesSqrIntegral/sizeof(float);
                    //Get sum area of neighborhood IJ
                    const float sqrSumIJNeighborhood = pSqrIntegralImage[indexIINeighborIJBaseWOffset + (i + windowRadius + neighborWidth)]
                                                    + pSqrIntegralImage[indexIINeighborIJBase + (i + windowRadius)] 
                                                    - pSqrIntegralImage[indexIINeighborIJBase + (i + windowRadius + neighborWidth)]
                                                    - pSqrIntegralImage[indexIINeighborIJBaseWOffset + (i + windowRadius)];
                    //Compute start pointer for each array
                    const float * restrict pWindowStart = (float*) &pSrcBorder[indexWindowStartBase + i];
                    const float * restrict pNeighborhoodStartIJ = (float*) &pSrcBorder[indexNeighborIJBase + i + windowRadius];
                    
                    //Compute window correlation with IJ neighborhood
                    #pragma acc loop worker tile(8,8)
                    for(int row_w = 0; row_w < windowHeight; row_w++)
                    {
                        for(int col_w = 0; col_w < windowWidth; col_w++)
                        {
                            float neighborCorrSum = 0;
                            #pragma acc loop vector collapse(2) reduction(+:neighborCorrSum)
                            for(int row_n = 0; row_n < neighborHeight; row_n++)
                            {
                                for(int col_n = 0; col_n < neighborWidth; col_n++)
                                {
                                    neighborCorrSum += pWindowStart[(col_w+col_n)+((row_w+row_n)*stepBytesSrcBorder/sizeof(float))] * pNeighborhoodStartIJ[col_n + (row_n * stepBytesSrcBorder/sizeof(float))]; 
                                }
                            }
                        pWindowIJCorr[col_w + row_w * windowWidth] = neighborCorrSum;
                       }
                    }
                   
                    #pragma acc loop collapse(2) 
                    for (int n = 0; n < windowHeight; n++)
                    {
                        for (int m = 0; m < windowWidth; m++)
                        {
                            //Compute base address for each array
                            const int indexIINeighborMNBase = (j + n) * stepBytesSqrIntegral/sizeof(float);
                            const int indexIINeighborMNBaseWOffset = (j + n + neighborWidth) * stepBytesSqrIntegral/sizeof(float);
                            
                            const float sqrSumMNNeighborhood = pSqrIntegralImage[indexIINeighborMNBaseWOffset + (i + m  + neighborWidth)]
                                                            + pSqrIntegralImage[indexIINeighborMNBase + (i + m )] 
                                                            - pSqrIntegralImage[indexIINeighborMNBase + (i + m  + neighborWidth)]
                                                            - pSqrIntegralImage[indexIINeighborMNBaseWOffset + (i + m )];
                            pEuclDist[n*windowWidth + m]= sqrSumIJNeighborhood + sqrSumMNNeighborhood -2*pWindowIJCorr[n*windowWidth + m];
                        }
                    }

                    float sumExpTerm = 0;
                    #pragma acc loop collapse(2) reduction(+:sumExpTerm)
                    for(int row = 0; row < windowHeight; row++)
                    {
                        for(int col = 0; col < windowWidth; col++)
                        {
                            float filterWeight = expf(pEuclDist[col + row * windowWidth] *  1/(-2*(sigma_r * sigma_r)));
                            pEuclDist[col + row * windowWidth] = filterWeight;
                            sumExpTerm += filterWeight;
                        }
                    }

                    float filterResult = 0;            
                    //Reduce(+) 
                    #pragma acc loop collapse(2) reduction(+:filterResult) 
                    for(int row = 0; row < windowHeight; row++)
                    {
                        for(int col = 0; col < windowWidth; col++)
                        {
                            float filterRes = pEuclDist[col + row * windowWidth] * pWindowStart[(col+neighborRadius) + (row+neighborRadius) * stepBytesSrcBorder/sizeof(float)];
                            filterResult += filterRes;                    
                        }
                    }
                    pDst[indexPdstBase + i] = filterResult/sumExpTerm;
                }   
            }
        }   
    }
}

As you can see the filter has 6 nested for-loops, the first two iterate through each image pixel, the next two iterate over a search window and the remaining two iterate over a pixel neighborhood. The problem I am facing is that when I use a tile clause over the search window loop the filter result has an MSE error of 1 compared to using a collapse(2) clause instead, which has a 3.51e-06 error (as expected). The tiled loop over the search window has a speedup of 10x compared to the collapsed version. I use the collapsed version as a reference.

I am working with a K40c with CUDA 10.1, PGI 14.04 and driver version 418.87.01.

Here is the compiler output for each version:

Collapsed version

DNLM_OpenACC(const float *, int, const float *, int, float *, int, int, int, int, int, int, int, int, int, float):
     15, Generating create(pWindowIJCorr[:windowWidth*windowHeight],pEuclDist[:windowWidth*windowHeight])
     17, Generating Tesla code
         19, #pragma acc loop gang collapse(2) /* blockIdx.x */
         21,   /* blockIdx.x collapsed */
         40, #pragma acc loop worker(2) collapse(2) /* threadIdx.y */
         42,   /* threadIdx.y collapsed */
         46, #pragma acc loop vector(32) collapse(2) /* threadIdx.x */
         48,   /* threadIdx.x collapsed */
             Generating reduction(+:neighborCorrSum)
         50, Vector barrier inserted for vector loop reduction
         51, Vector barrier inserted due to potential dependence out of a vector loop
         58, #pragma acc loop worker(2), vector(32) collapse(2) /* threadIdx.
         60,   /* threadIdx.y threadIdx.x collapsed */
         76, #pragma acc loop worker(2), vector(32) collapse(2) /* threadIdx.y threadIdx.x */
         78,   /* threadIdx.y threadIdx.x collapsed */
             Generating reduction(+:sumExpTerm)
         89, #pragma acc loop worker(2), vector(32) collapse(2) /* threadIdx.y threadIdx.x */
         91,   /* threadIdx.y threadIdx.x collapsed */
             Generating reduction(+:filterResult)
     40, Loop is parallelizable
     42, Loop is parallelizable
     46, Loop is parallelizable
     48, Loop is parallelizable
     58, Loop is parallelizable
     60, Loop is parallelizable
     76, Loop is parallelizable
     78, Loop is parallelizable
     89, Loop is parallelizable
     91, Loop is parallelizable

Tiled version

DNLM_OpenACC(const float *, int, const float *, int, float *, int, int, int, int, int, int, int, int, int, float):
     15, Generating create(pWindowIJCorr[:windowWidth*windowHeight],pEuclDist[:windowWidth*windowHeight])
     17, Generating Tesla code
         19, #pragma acc loop gang collapse(2) /* blockIdx.x */
         21,   /* blockIdx.x collapsed */
         40, #pragma acc loop worker(64) tile(8,8) /* threadIdx.y */
         42,   /* threadIdx.y tiled */
         46, #pragma acc loop vector(32) collapse(2) /* threadIdx.x */
         48,   /* threadIdx.x collapsed */
             Generating reduction(+:neighborCorrSum)
         50, Vector barrier inserted for vector loop reduction
         51, Vector barrier inserted due to potential dependence out of a vector loop
         58, #pragma acc loop worker(2), vector(32) collapse(2) /* threadIdx.
         60,   /* threadIdx.y threadIdx.x collapsed */
         76, #pragma acc loop worker(2), vector(32) collapse(2) /* threadIdx.y threadIdx.x */
         78,   /* threadIdx.y threadIdx.x collapsed */
             Generating reduction(+:sumExpTerm)
         89, #pragma acc loop worker(2), vector(32) collapse(2) /* threadIdx.y threadIdx.x */
         91,   /* threadIdx.y threadIdx.x collapsed */
             Generating reduction(+:filterResult)
     40, Loop is parallelizable
     42, Loop is parallelizable
     46, Loop is parallelizable
     48, Loop is parallelizable
     58, Loop is parallelizable
     60, Loop is parallelizable
     76, Loop is parallelizable
     78, Loop is parallelizable
     89, Loop is parallelizable
     91, Loop is parallelizable

How could you explain the effect of each clause in the mapping of the accelerated code on the GPU?
What could be the source of error? Tiled loops may be increasing race conditions?

Thanks.

Hi manu3193,

The tiled loop over the search window has a speedup of 10x compared to the collapsed version. I use the collapsed version as a reference.

I’m guessing the compiler is generating bad code, or some other issue, so may not actually be performing all the computation. Hence, I’d suggest ignoring any performance difference unless you are getting the expected answer.

If you can provide a reproducing example, I can take a look. If it is a compiler issue, I’ll write-up a problem report and see what can be done.

Though while legal, I’ve never seen a code that used tile on a worker level loops before so it may not be unexpected that the compiler would trip over it. Basically, this will block the loops into an 8x8 tile with each worker executing a tile of size N1/8xN2/8 elements.

The main difference between tile and collapse is how the workers will group the data. With tile, they execute over a block of data, while with collapse, the worker would execute in sequence. i.e. for 2 workers, one executes the odd indices, and the other the even. With 4 workers, worker 0 executes indices 0,3,7,11, …, 1 executes 1,3,7,12,…, and so forth.

In the case of this code, it probably doesn’t matter if the workers are tiled or collapsed since your fastest dimension is “col_n” which is parallelized using vector so the data accesses are coalesced.

Some other schedules to try are:

  1. Only using worker on the “col_w” loop so that “(col_w+col_n)” are grouped.
  2. Adjust the number of workers from 2 to 4, 8, 16, 32 and 64. Though I suspect this wont gain much. Maybe 4, but probably not the others.
  3. Run the inner loops sequentially and replace “worker collapse(2)” with “vector collapse(2)” but use a larger vector length (like 128).

For #3, this may help when “neighborHeight” and “neighborWidth” are smaller (like <32) since there’s overhead associated with a reduction. Though you’d loose some memory coalescing which may offset the gain. Easy to experiment, so worth a try.

Let me know the approximate sizes for each of the loop bounds and I may be able to offer some other ideas to try.

-Mat

Hi mkcolg,

I tried what you suggested, the image filter executes as expected. However, the execution time increases. The best execution time I managed to get so far is 136.9 s with an input image of 4096x4096 pixels, by using the collapse(2) clause over each of the for-loop pairs. The size of the inner loop is 7x7 for neighborhoodWidth and neighborhoodHeight, and 21x21 for windowHeight and windowWidth.

You can find the code here: https://github.com/manu3193/DNLM-P/tree/npp-openacc-fft/src. You should modify the Makefile to make sure you point out to the location of your cuda (I use cuda 10.1) and opencv (3.4) installation for libraries and headers.

The image to filter is on the repo and also the expected filtered image result. The program should show you the MSE after filtering the image. The program binary is nlmfiler and you should be able to filter an image by giving its filename as an argument: ./nlmfilter 4096x4096.png

I would appreciate your help on giving me some insights on how to properly accelerate the code.

Thanks,

Hi manu3193,

I was able to get your programming built and run after installing OpenCV.

Using a V100 with the PGI 19.10 noLLVM pgc++ compiler, I see a time of about 4.5 seconds.

Given the sizes you note, I thought it would be better to not vectorize the neighborhood loops, just the window loops. This got the time down to about 1 second.

I then hoisted the repeated “stepByte…/sizeof(float)” out of the loops and stored them into local variables. This improved the time to 0.715 seconds.

Next I ran the code through Nsight-Compute (https://docs.nvidia.com/nsight-compute/NsightComputeCli/index.html). The actual occupancy is good at 90% as is the cache utilization at 91%. The main bottle neck is that warp are stalling on the “LG Throttle”:

From the NSight-Compute docs, LG Throttle means:

Warp was stalled waiting for the L1 instruction queue for local and global (LG) memory operations to be not full. Typically this stall occurs only when executing local or global memory instructions extremely frequently. If applicable, consider combining multiple lower-width memory operations into fewer wider memory operations and try interleaving memory operations and math instructions.

Not sure how to improve the performance more given the cache loads are already optimal and there’s too few math operation to interleave nor a good way widen the memory operations.

Here’s a diff between your original file and my edits. Note that I also include “accelmath.h” to the device prototype of “exp”. This is only needed when using later GNU installs due to a change in g++'s cmath header file.

% diff -u DNLMFilter_OpenACC.org.cpp DNLMFilter_OpenACC.cpp
--- DNLMFilter_OpenACC.org.cpp  2019-12-10 09:51:25.904371540 -0800
+++ DNLMFilter_OpenACC.cpp      2019-12-10 10:42:58.988334834 -0800
@@ -1,5 +1,6 @@
 #include <algorithm>
 #include <cmath>
+#include <accelmath.h>

 void DNLM_OpenACC(const float* pSrcBorder, int stepBytesSrcBorder, const float* pSqrIntegralImage, int stepBytesSqrIntegral, float* pDst, int stepBytesDst, int windowRadius, int neighborRadius, int imageWidth, int imageHeight, int windowWidth, int windowHeight, int neighborWidth, int neighborHeight, float sigma_r)

@@ -8,10 +9,13 @@
     //Array to store window matrix for euclidean distance
     float * restrict pEuclDist = (float*) malloc(windowHeight * windowWidth * sizeof(float));
     float * restrict pWindowIJCorr = (float*) malloc(windowHeight * windowWidth * sizeof(float));
+    const int stepBD = stepBytesDst/sizeof(float);
+    const int stepBSI = stepBytesSqrIntegral/sizeof(float);
+    const int stepBSB = stepBytesSrcBorder/sizeof(float);

-    #pragma acc data deviceptr(pSrcBorder[(windowHeight+2*(windowRadius+neighborRadius))*(windowWidth+2*(windowRadius+neighborRadius))], pSqrIntegralImage[(windowHeight+2*(windowRadius+neighborRadius)+1)*(windowWidth+2*(windowRadius+neighborRadius)+1)], pDst[imageHeight*imageWidth]) create(pEuclDist[0:windowHeight*windowWidth], pWindowIJCorr[windowHeight*windowWidth])
+    #pragma acc data deviceptr(pSrcBorder[(windowHeight+2*(windowRadius+neighborRadius))*(windowWidth+2*(windowRadius+neighborRadius))], pSqrIntegralImage[(windowHeight+2*(windowRadius+neighborRadius)+1)*(windowWidth+2*(windowRadius+neighborRadius)+1)], pDst[imageHeight*imageWidth])
     {
-        #pragma acc parallel vector_length(32)
+        #pragma acc parallel
         {
            #pragma acc loop gang collapse(2) private(pEuclDist[0:windowHeight*windowWidth], pWindowIJCorr[0:windowHeight*windowWidth])
             for(int j = 0; j < imageHeight; j++)
@@ -19,11 +23,11 @@
                 for (int i = 0; i < imageWidth; i++)
                 {
                     //Compute base address for each array
-                    const int indexPdstBase = j * (stepBytesDst/sizeof(float));
-                    const int indexWindowStartBase = j * stepBytesSrcBorder/sizeof(float);
-                    const int indexNeighborIJBase = (j + windowRadius) * stepBytesSrcBorder/sizeof(float);
-                    const int indexIINeighborIJBase = (j + windowRadius) * stepBytesSqrIntegral/sizeof(float);
-                    const int indexIINeighborIJBaseWOffset = (j + windowRadius + neighborWidth) * stepBytesSqrIntegral/sizeof(float);
+                    const int indexPdstBase = j * (stepBD);
+                    const int indexWindowStartBase = j * stepBSB;
+                    const int indexNeighborIJBase = (j + windowRadius) * stepBSB;
+                    const int indexIINeighborIJBase = (j + windowRadius) * stepBSI;
+                    const int indexIINeighborIJBaseWOffset = (j + windowRadius + neighborWidth) * stepBSI;
                     //Get sum area of neighborhood IJ
                     const float sqrSumIJNeighborhood = pSqrIntegralImage[indexIINeighborIJBaseWOffset + (i + windowRadius + neighborWidth)]
                                                     + pSqrIntegralImage[indexIINeighborIJBase + (i + windowRadius)]
@@ -34,45 +38,42 @@
                     const float * restrict pNeighborhoodStartIJ = (float*) &pSrcBorder[indexNeighborIJBase + i + windowRadius];

                     //Compute window correlation with IJ neighborhood
-                    #pragma acc loop collapse(2)
+                    #pragma acc loop vector collapse(2)
                     for(int row_w = 0; row_w < windowHeight; row_w++)
                     {
                         for(int col_w = 0; col_w < windowWidth; col_w++)
                         {
                             float neighborCorrSum = 0;
-                            #pragma acc loop vector collapse(2) reduction(+:neighborCorrSum)
                             for(int row_n = 0; row_n < neighborHeight; row_n++)
                             {
                                 for(int col_n = 0; col_n < neighborWidth; col_n++)
                                 {
-                                    neighborCorrSum += pWindowStart[(col_w+col_n)+((row_w+row_n)*stepBytesSrcBorder/sizeof(float))] * pNeighborhoodStartIJ[col_n + (row_n * stepBytesSrcBorder/sizeof(float))];
+                                    neighborCorrSum += pWindowStart[(col_w+col_n)+((row_w+row_n)*stepBSB)] * pNeighborhoodStartIJ[col_n + (row_n * stepBSB)];
                                 }
                             }
-                        pWindowIJCorr[col_w + row_w * windowWidth] = neighborCorrSum;
+                            pWindowIJCorr[col_w + row_w * windowWidth] = neighborCorrSum;
                        }
                     }

-                    #pragma acc loop collapse(2)
+                    #pragma acc loop vector collapse(2)
                     for (int n = 0; n < windowHeight; n++)
                     {
-
                         for (int m = 0; m < windowWidth; m++)
                         {
                             //Compute base address for each array
-                            const int indexIINeighborMNBase = (j + n) * stepBytesSqrIntegral/sizeof(float);
-                            const int indexIINeighborMNBaseWOffset = (j + n + neighborWidth) * stepBytesSqrIntegral/sizeof(float);
+                            const int indexIINeighborMNBase = (j + n) * stepBSI;
+                            const int indexIINeighborMNBaseWOffset = (j + n + neighborWidth) * stepBSI;

                             const float sqrSumMNNeighborhood = pSqrIntegralImage[indexIINeighborMNBaseWOffset + (i + m  + neighborWidth)]
                                                             + pSqrIntegralImage[indexIINeighborMNBase + (i + m )]
                                                             - pSqrIntegralImage[indexIINeighborMNBase + (i + m  + neighborWidth)]
                                                             - pSqrIntegralImage[indexIINeighborMNBaseWOffset + (i + m )];
-                            //#pragma acc loop seq
                             pEuclDist[n*windowWidth + m]= sqrSumIJNeighborhood + sqrSumMNNeighborhood -2*pWindowIJCorr[n*windowWidth + m];
                         }
                     }

                     float sumExpTerm = 0;
-                    #pragma acc loop collapse(2) reduction(+:sumExpTerm)
+                    #pragma acc loop vector collapse(2) reduction(+:sumExpTerm)
                     for(int row = 0; row < windowHeight; row++)
                     {
                         for(int col = 0; col < windowWidth; col++)
@@ -85,12 +86,12 @@

                     float filterResult = 0;
                     //Reduce(+)
-                    #pragma acc loop collapse(2) reduction(+:filterResult)
+                    #pragma acc loop vector collapse(2) reduction(+:filterResult)
                     for(int row = 0; row < windowHeight; row++)
                     {
                         for(int col = 0; col < windowWidth; col++)
                         {
-                            float filterRes = pEuclDist[col + row * windowWidth] * pWindowStart[(col+neighborRadius) + (row+neighborRadius) * stepBytesSrcBorder/sizeof(float)];
+                            float filterRes = pEuclDist[col + row * windowWidth] * pWindowStart[(col+neighborRadius) + (row+neighborRadius) * stepBSB];
                             filterResult += filterRes;
                         }
                     }

Note that I noticed an issue when compiling your code with pgc++ LLVM (a compiler segv) and have reported it to our engineers as TPR #27934.

Hope this helps,
Mat

Hello again Mat,


The filter version I am working with this time uses a mix of CUDA libs and OpenACC accelerated regions, which I have successfully compile it using the “-ta=teslta:nordc” flag, as shown in the Makefile below:

# define the compilers to use
CC = nvcc
PGI = pgcc
PGI++ = pgc++
# define any compile-time flags
CFLAGS = -std=c++11 -ccbin=$(PGI++) -Xcompiler "-ta=tesla:cc35 -ta=tesla:nordc -Mcuda" -cudart static -use_fast_math -O3
CFLAGSPGI = -acc -Minfo=accel -ta=tesla:cc35 -ta=tesla:nordc -fast -Minline -Mvect -O3
LIBS = -lnppial_static -lnppitc_static -lnppidei_static -lnppif_static -lnppisu_static -lnppist_static -lnpps_static -lnppc_static -lculibos -lopencv_core -lopencv_highgui -lopencv_imgcodecs
LDFLAGS = -L$(CUDA_HOME)/lib64 -L$(OPENCV_HOME)/lib

# define include path flags 
INCLUDES= -I../include -I$(CUDA_HOME)/include -I$(OPENCV_HOME)/include

# define the executable file
MAIN = nlmfilter_k40_fft

all: clean $(MAIN)

.PHONY: clean

$(MAIN): fft.o DNLMFilter_OpenACC.o ParallelDNLM.o
        $(CC) $(CFLAGS) -o $@ $^ $(LDFLAGS) $(INCLUDES)  $(LIBS)  

ParallelDNLM.o: ParallelDNLM.cpp
        $(CC) $(CFLAGS) $(INCLUDES) -c $< -o $@

DNLMFilter_OpenACC.o: DNLMFilter_OpenACC.c
        $(PGI) $(CFLAGSPGI) -c $< -o $@ $(INCLUDES)

fft.o: fft.c
        $(PGI) $(CFLAGSPGI) -c $< -o $@ $(INCLUDES)

clean:
        rm -rf $(MAIN) *.o

Now that I added several device routines to perform image correlation in the frequency domain with a DFT algorithm, I am facing problems with the compilation. I added routine pragmas on top of each routine prototype definition in the fft.h file, and also the corresponding pragma to the chosen for-loops in the implementation fft.c. If I try to compile it using the same flag as the above result, I get this result:

https://github.com/manu3193/DNLM-P/issues/4

I found some hints about separate compilation for device and host code, so I made modifications to the Makefile to add “-rdc=true -Xcompiler -ta=tesla:rdc -dc” to the host compile step, as you can see in the modified Makefile:

# define the compilers to use
CC = nvcc
PGI = pgcc
PGI++ = pgc++
# define any compile-time flags
CFLAGS_LINK = -acc -Minfo=all -ta=tesla:cc35  -Mcuda -O3
CFLAGS = -std=c++11 -arch=sm_35 -rdc=true -ccbin=$(PGI++) -Xcompiler "-ta=tesla:cc35 -ta=tesla:rdc -Mcuda" -cudart static -use_fast_math -O3
CFLAGSPGI = -acc -Minfo=all -ta=tesla:cc35 -fast -Minline -Mvect -O3
LIBS = -lnppial_static -lnppitc_static -lnppidei_static -lnppif_static -lnppisu_static -lnppist_static -lnpps_static -lnppc_static -lculibos -lopencv_core -lopencv_highgui -lopencv_imgcodecs
LDFLAGS = -L$(CUDA_HOME)/lib64 -L$(OPENCV_HOME)/lib

# define include path flags 
INCLUDES= -I../include -I$(CUDA_HOME)/include -I$(OPENCV_HOME)/include

# define the executable file
MAIN = nlmfilter_k40_fft

all: clean $(MAIN)

.PHONY: clean

$(MAIN): fft.o DNLMFilter_OpenACC.o ParallelDNLM.o
        $(PGI++) $(CFLAGS_LINK) -o $@ $^ $(LDFLAGS) $(INCLUDES)  $(LIBS)  

ParallelDNLM.o: ParallelDNLM.cpp
        $(CC) $(CFLAGS) $(INCLUDES) -dc $< -o $@

DNLMFilter_OpenACC.o: DNLMFilter_OpenACC.c
        $(PGI) $(CFLAGSPGI) -c $< -o $@ $(INCLUDES)

fft.o: fft.c
        $(PGI) $(CFLAGSPGI) -c $< -o $@ $(INCLUDES)

clean:
        rm -rf $(MAIN) *.o

Compilation result:
https://github.com/manu3193/DNLM-P/issues/5

The compilation seems to be successful now, but the program fails to run, I am getting a lot of invalid writes to memory.


Could you give it a look in case I am missing something in the compilation or if there is something wrong with pragmas, data and parallel regions? The code repo is here https://github.com/manu3193/DNLM-P/tree/npp-openacc-fft/src

Thanks in advance.

Hi manu3193,

Glad I kept my OpenCV build around!

I assumed you were using the npp-openacc-fft branch so am using that. I had a few issues compiling your code with 20.1 but was able work around them. Notably in 20.1 we switched to a new C compiler (in 19.10 its called pgcc18) which isn’t getting the C99 definition of “conj” so can’t find the device definition. I’ll look at this later, but reverted to the older C compiler, pgcc11.

Also, I modified your makefile so I could use RDC by using pgcc11 to link. noRDC is only needed if you use a non-PGI driver to link.

Here’s what I used:

CC = nvcc
PGI = pgcc11
PGI++ = pgc++
# define any compile-time flags
CFLAGS = -std=c++11 -ccbin=$(PGI++) -Xcompiler "-ta=tesla:cc70 -ta=tesla -Mcuda -mcmodel=medium" -cudart static -use_fast_math -O3
CFLAGSPGI = -acc -Minfo=accel -ta=tesla:cc70 -ta=tesla -fast -Minline -Mvect -O3 -Mcuda -mcmodel=medium
LIBS = -lnppial_static -lnppitc_static -lnppidei_static -lnppif_static -lnppisu_static -lnppist_static -lnpps_static -lnppc_static -lculibos -lopencv_core -lopencv_highgui -lopencv_imgcodecs
LDFLAGS = -L$(CUDA_HOME)/lib64 -L$(OPENCV_HOME)/lib -pgc++libs

# define include path flags
INCLUDES= -I../include -I$(CUDA_HOME)/include -I$(OPENCV_HOME)

# define the executable file
MAIN = nlmfilter_k40_fft

all: clean $(MAIN)

.PHONY: clean

$(MAIN): fft.o DNLMFilter_OpenACC.o ParallelDNLM.o
        $(PGI) $(CFLAGSPGI) -o $@ $^ $(LDFLAGS) $(INCLUDES)  $(LIBS)

ParallelDNLM.o: ParallelDNLM.cpp
        $(CC) $(CFLAGS) $(INCLUDES) -c $< -o $@

From there, I was able to get a build and now get an illegal memory error in the DNLM_OpenACC kernel. Looks like anytime “pSqrIntegralImage” is accessed, the memory error.s At least in my case, “pSqrIntegralImage” appears to be a NULL pointer.

Tracing this back to where it’s allocated in filterDNLM, it seems that the call to nppiMalloc_32f_C1 is returning a NULL pointer.

.. adding print statements after the calls ...
     pSrcImage8u = nppiMalloc_8u_C1(imageROISize.width, imageROISize.height, &stepBytesSrc8u);
printf("imageROIwBorderSize.width=%d imageROIwBorderSize.height=%d\n",imageROIwBorderSize.width, imageROIwBorderSize.height);
    pSrcwBorderImage32f = nppiMalloc_32f_C1(imageROIwBorderSize.width, imageROIwBorderSize.height, &stepBytesSrcwBorder32f);
printf("pSrcwBorderImage32f = %p\n",pSrcwBorderImage32f);
    pIntegralImage32f = nppiMalloc_32f_C1(integralImageROISize.width, integralImageROISize.height, &stepBytesIntegral);
printf("pIntegralImage32f = %p\n",pIntegralImage32f);
... 
% nlmfilter_k40_fft 4096x4096.png
imageROIwBorderSize.width=8773090 imageROIwBorderSize.height=8773090
pSrcwBorderImage32f = (nil)
pIntegralImage32f = (nil)
launch CUDA kernel  file=DNLM-P/src/DNLMFilter_OpenACC.c function=DNLM_OpenACC line=34 device=0 threadid=1 num_gangs=65535 num_wor   kers=1 vector_length=128 grid=65535 block=128
Failing in Thread:1
call to cuStreamSynchronize returned error 700: Illegal address during kernel execution

I’ve not used NPP myself so can’t really offer any advice, though assuming I’ve correctly replicated your issue, you may want to start there.

-Mat

Hi again Mat,

I have been struggling with the null pointer returned from NPP allocation function for a week, I noticed that if I force no optimization in the compilation with -O0 the problem is gone and the program runs. However, the run-time is so much slower than the non accelerated program. Do you have an idea why is this happening? Could you give some insights on how to optimize the paralelization?

Thanks in advance.

Hi manu3193,

I wondering if the problem is that you’re try to allocate too large of an array?

I updated my git repo to your latest version. The error I’m seeing is an “out-of-memory” error. While I might be running it incorrectly, when I print out the “paddedSize”, it’s defaulting to 32k, with larger values when using larger window values.

You then have 7 private arrays of size “paddedSize*paddedSize”, or 1GB each, 7 GB per gang. This is way too large.

Are you computing the paddedSize correctly? Should it really be 32K?

-Mat

Yes, you are right. It should be 32 if you are using the same parameters. Its already fixed. Now you should be able to run it.

It still fails in the same way and paddedSize is still 32K. But I do now see following errors:

problem alocating memory pSrcwBorderImage32f 0
problem alocating memory pIntegralImage32f 0

Which I assume is what you’re struggling with. Unfortunately, I’m not going to be much help here since this has something to do on how you’re using NPP which I have no experience. The OpenACC issues, then I’m your best source for help, but NPP isn’t.

I tried my best to research how to use these calls, but most NPP example I could find use cudaMalloc and not the nppMalloc.

What I would suggest, is to reduce your test case down to the smallest example you can but still reproduces the problem, then post your question and code over on NVIDIA’s DevTalk forums. In particular the Accelerated Libraries Forum: https://forums.developer.nvidia.com/c/accelerated-computing/gpu-accelerated-libraries/12

Wish I could do more, but NPP is out of scope for this forum. Once you have that working, we can come back and work on the OpenACC portion of the code.

-Mat

Hello again Mat,


I replaced NPP calls with opencv functions to focus on openacc parallel region. The program compiles for multicore architecture at least for now, maybe is better to start from here to achieve appropriate speedup on target K40c GPU. I created a new git branch https://github.com/manu3193/DNLM-P/tree/opencv-fft-openacc/src.

It takes 20s denoising an image with size 256x256 pixels, on a Xeon quad core with AVX2 at 3.3 GHz. The command I used is

./nlmfilter_multicore_fft -w 7 -n 3 -s 0.5 256x256.png

The following is the displayed information of pgi compiler.

DNLM_OpenACC(const float *, int, const float *, int, float *, int, int, int, int, int, int, int, int, int, float):
     34, Generating Multicore code
         36, #pragma acc loop gang
     66, Accelerator restriction: size of the GPU copy of pWindowIJCorr,pEuclDist is unknown
         Loop is parallelizable
     68, Loop is parallelizable
     84, Accelerator restriction: size of the GPU copy of pEuclDist is unknown
         Loop is parallelizable
     86, Loop is parallelizable
     97, Accelerator restriction: size of the GPU copy of pEuclDist is unknown
         Loop is parallelizable
     99, Loop is parallelizable

Hello Mat,

Just a quick follow up here to see if I can get this filter properly accelerated.

Thanks for your help,