How to not parallelize inner loops in OpenACC ?

I am a beginner in doing GPU programming with OpenACC. I was trying to do a direct convolution. Convolution consists of 6 nested loops. I only want the first loop to be parallelized. I gave the pragma #pragma acc loop for the first loop and #pragma acc loop seq for the rest. But the output that I am getting is not correct. Is the approach taken by me to parallelize the loop correct ? Specifications for the convolution: Input channels-3, Input Size- 224X224X3, Output channels- 64, Output Size- 111X111X64, filter size- 3X3X3X64.

Code snippet is as follows:

# include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "squeezenet_params.h"
#include "dog.h"

void conv3x3(
  const int input_channels, const int input_size,
  const int pad, const int stride, const int start_channel,
  const int output_size, const float* restrict input_im, const float* restrict filter_weight,
  const float* restrict filter_bias, float* restrict output_im){
    #pragma acc data copyin (input_im[0:150527],filter_weight[0:1727],filter_bias[0:63]) copyout(output_im[0:788543])
    {
    #pragma acc parallel
    {
      #pragma acc loop
      for(int p=0;p<64;++p){
        filter_weight += p * input_channels * 9;
        float bias = filter_bias[p];
        output_im += (start_channel + p) * output_size * output_size;

        //loop over output feature map
        #pragma acc loop seq
        for(int i = 0; i < output_size; i++)
        {
          #pragma acc loop seq
          for(int j = 0; j < output_size; j++)
          {
            //compute one element in the output feature map
            float tmp = bias;

            //compute dot product of 2 input_channels x 3 x 3 matrix
            #pragma acc loop seq
            for(int k = 0; k < input_channels; k++)
            {
              #pragma acc loop seq
              for(int l = 0; l < 3; l++)
              {
                int h = i * stride + l - pad;
                #pragma acc loop seq
                for(int m = 0; m < 3; m++)
                {
                  int w = j * stride + m - pad;
                  if((h >= 0) && (h < input_size) && (w >= 0) && (w < input_size))
                  {
                    tmp += input_im[k * input_size * input_size + (i * stride + l - pad) * input_size + j * stride + m - pad] \
                                     * filter_weight[9 * k + 3 * l + m];
                  }
                }
              }
            }

            //add relu activation after conv
            output_im[i * output_size + j] = (tmp > 0.0) ? tmp : 0.0;
          }
        }
      }
    }
    }
  }

  void main(){
    float * result = (float*)malloc(sizeof(float) * (1 * 64 * 111 * 111));
    conv3x3(3,224,0,2,0,111,sample,conv1_weight,conv1_bias,result);
    for(int i=0;i<64 * 111 * 111;++i){
      //if(result[i]>0)
        printf("%f:%d\n",result[i],i);
    }
  }

Hi mayur2110,

I saw your post over on StackOverflow, but will respond here first and then respond over there once we have a better understanding of the issue.

Can you please provide the two header files, squeezenet_params.h and dog.h? I can’t compile your code since these are missing.

If you can’t provide the header files, can you please post the output from the compiler feedback? (i.e. the compilation output including the flag “-Minfo=accel”). Using “loop seq” should cause the compiler to not parallelize these loops and the compiler feedback messages will show this.

Though, in just looking at you’re code, the most likely problem is how you’re using “output_im”. This is a global shared pointer that you’re incrementing in the parallel loop. So when one thread increments it, it’s incremented for all threads. So most of the threads will be updating to the wrong section of “output_im”. Instead, you need to be using a pointer that’s private (local) to each thread who’s value is an offset into “output_im”.

-Mat

Hi mkcolg,

Thanks for your response. Following is the link for the header files.

https://drive.google.com/drive/folders/1a9XRjBTrEFIorrLTPFHS4atBOPrG886i

Hi mkcolg,

I followed your feedback and made following changes in the way I was accessing the matrices output_im and filter_weight in the kernel :

float * fw = (float*)(filter_weight + p * input_channels * 9);
float bias = filter_bias[p];
float * oi = (float*)(output_im + (start_channel + p) * output_size * output_size)

I am still not getting the correct output values. I made the OpenACC code (that I asked in the question) from my CUDA code. In CUDA, the code runs well and gives me the correct output values for the convolution.

Thanks.

Yes, the compiler feedback messages do show that only the loop at like 17 is getting parallelized:

% pgcc -ta=tesla -Minfo=accel test.c
"test.c", line 62: warning: return type of function "main" must be "int"
    void main(){
         ^

conv3x3:
     13, Generating copyin(filter_weight[:1727],filter_bias[:63],input_im[:150527]) [if not already present]
         Generating copyout(output_im[:788543]) [if not already present]
     15, Generating Tesla code
         17, #pragma acc loop gang, vector(64) /* blockIdx.x threadIdx.x */
         24, #pragma acc loop seq
         27, #pragma acc loop seq
         34, #pragma acc loop seq
         37, #pragma acc loop seq
         41, #pragma acc loop seq

Also as I suspected, your problem is with incrementing output_im. Again, this is being done in parallel so every thread is incrementing it so only really one thread will be using the correct offset.

Also, your code seg faults when run on host. Similar issue where incrementing output_im puts the pointer out of alignment when p=11.

The fix for both, is to compute an offset instead of incrementing output_im. Something like the following.

Note that I’ve also parallelized the i and j loops. The kernel is about 32x faster than if you only parallelize the outermost loop.

#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "squeezenet_params.h"
#include "dog.h"

void conv3x3(
  const int input_channels, const int input_size,
  const int pad, const int stride, const int start_channel,
  const int output_size, const float* restrict input_im, const float* restrict filter_weight,
  const float* restrict filter_bias, float* restrict output_im){

    #pragma acc data copyin (input_im[0:150527],filter_weight[0:1727],filter_bias[0:63]) copyout(output_im[0:788543])
    {
    #pragma acc parallel
    {
      #pragma acc loop gang
      for(int p=0;p<64;++p){
        filter_weight += p * input_channels * 9;
        float bias = filter_bias[p];
        int offset;
        offset = (start_channel + p) * output_size * output_size;

        //loop over output feature map
        #pragma acc loop vector collapse(2)
        for(int i = 0; i < output_size; i++)
        {
          for(int j = 0; j < output_size; j++)
          {
            //compute one element in the output feature map
            float tmp = bias;

            //compute dot product of 2 input_channels x 3 x 3 matrix
            #pragma acc loop seq
            for(int k = 0; k < input_channels; k++)
            {
              #pragma acc loop seq
              for(int l = 0; l < 3; l++)
              {
                int h = i * stride + l - pad;
                #pragma acc loop seq
                for(int m = 0; m < 3; m++)
                {
                  int w = j * stride + m - pad;
                  if((h >= 0) && (h < input_size) && (w >= 0) && (w < input_size))
                  {
                    tmp += input_im[k * input_size * input_size + (i * stride + l - pad) * input_size + j * stride + m - pad] \
                                     * filter_weight[9 * k + 3 * l + m];
                  }
                }
              }
            }

            //add relu activation after conv
            int idx = offset + (i * output_size + j);
            output_im[idx] = (tmp > 0.0) ? tmp : 0.0;
          }
        }
      }
    }
    }
  }

  void main(){
    float * result = (float*)malloc(sizeof(float) * (1 * 64 * 111 * 111));
    conv3x3(3,224,0,2,0,111,sample,conv1_weight,conv1_bias,result);
    for(int i=0;i<64 * 111 * 111;++i){
      //if(result[i]>0)
        printf("%f:%d\n",result[i],i);
    }
  }

Note that you’ll see slightly differing results (last digit) between the CPU and GPU due to rounding during the reduction.

-Mat

I compiled the code with the modifications that you mentioned with -acc flag and without -acc flag. The compilation without -acc flag (CPU) gave me the correct results which I got from my CUDA code. But still the output with -acc flag (GPU) is incorrect.

Ah, yes. I was only looking at the first part of the results. As they go further, they do diverge.

Looking again, I do see two additional issues.

I missed that “filter_weight” is also a pointer that you’re adding, and thus has the same race condition as output_im. This needs to be an offset as well. Note in the original version, your code will seg fault on the CPU since you are incrementing these pointers by a multiple of the index variable “p” rather than just the fixed value of “input_channels*9”. In other words, you’re incrementing the pointer by larger and larger chunks until the program starts reading off the end of the array and eventually seg faults.

The second issue is that you are off by one in your array shapes. In C/C++, the array shapes should be in the form “arr[start_element:number_of_elements]”. You have a range. To fix, add one to each of the number of elements.

% cat test.c
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "squeezenet_params.h"
#include "dog.h"

void conv3x3(
  const int input_channels, const int input_size,
  const int pad, const int stride, const int start_channel,
  const int output_size, const float* restrict input_im, const float* restrict filter_weight,
  const float* restrict filter_bias, float* restrict output_im){

    #pragma acc data copyin (input_im[0:150528],filter_weight[0:1728],filter_bias[0:64]) copyout(output_im[0:788544])
    {
    #pragma acc parallel
    {
      #pragma acc loop gang
      for(int p=0;p<64;++p){
        float bias = filter_bias[p];
        int offset, foffset;
        offset = (start_channel + p) * output_size * output_size;
        foffset =  p * input_channels * 9;

        //loop over output feature map
        #pragma acc loop vector collapse(2)
        for(int i = 0; i < output_size; i++)
        {
          for(int j = 0; j < output_size; j++)
          {
            //compute one element in the output feature map
            float tmp = bias;

            //compute dot product of 2 input_channels x 3 x 3 matrix
            for(int k = 0; k < input_channels; k++)
            {
              for(int l = 0; l < 3; l++)
              {
                int h = i * stride + l - pad;
                for(int m = 0; m < 3; m++)
                {
                  int w = j * stride + m - pad;
                  if((h >= 0) && (h < input_size) && (w >= 0) && (w < input_size))
                  {
                    tmp += input_im[k * input_size * input_size + (i * stride + l - pad) * input_size + j * stride + m - pad] \
                                     * filter_weight[foffset + (9 * k + 3 * l + m)];
                  }
                }
              }
            }

            //add relu activation after conv
            int idx = offset + (i * output_size + j);
            output_im[idx] = (tmp > 0.0) ? tmp : 0.0;
          }
        }
      }
    }
    }
  }

  int main(){
    float * result = (float*)malloc(sizeof(float) * (1 * 64 * 111 * 111));
    conv3x3(3,224,0,2,0,111,sample,conv1_weight,conv1_bias,result);
    for(int i=0;i<64 * 111 * 111;++i){
      //if(result[i]>0)
        printf("%f:%d\n",result[i],i);
    }
    exit(0);
  }

% pgcc -O0 test.c -o cpu.out
% pgcc -ta=tesla -Minfo=accel test.c -o acc.out
conv3x3:
     14, Generating copyin(filter_weight[:1728],filter_bias[:64],input_im[:150528]) [if not already present]
         Generating copyout(output_im[:788544]) [if not already present]
     16, Generating Tesla code
         18, #pragma acc loop gang /* blockIdx.x */
         26, #pragma acc loop vector(128) collapse(2) /* threadIdx.x */
         28,   /* threadIdx.x collapsed */
         34, #pragma acc loop seq
         36, #pragma acc loop seq
         39, #pragma acc loop seq
     26, Loop is parallelizable
     28, Loop is parallelizable
     34, Loop is parallelizable
     36, Loop carried scalar dependence for tmp at line 53,44
     39, Loop carried scalar dependence for tmp at line 53,44
% ./cpu.out > cpu.log
% ./acc.out > acc.log
% diff cpu.log acc.log
%

-Mat

It worked Mat. Thank you very much for your help.