Mergesort in OpenAcc

I am fairly new to OpenAcc programming and I am a student. I am trying to write a mergesort using openAcc but I am running into problems of having dependencies that prevent parallelization. Can mergesort be parallelized in the way I have written it? I also wrote a bitonic sort using openacc and that works fine without dependencies. I tried making the scalar values private and it supressed the warnings but the time remained the same on about 10M elements I am getting about 5 seconds when the bitonic version is less than 0.3 sec and the serial version without openacc is about 1 second.


// Utility function to find minimum of two integers
#pragma acc routine seq
int inline min(int x, int y)
    return (x < y) ? x : y;

// Iteratively sort array A[low..high] using temporary array
void mergesort(double *restrict A, double *restrict temp, int low, int high)
    int m=0, s=0;
    int i=0,j=0,k=0;
    // divide the array into blocks of size m
    // m = [1, 2, 4, 8, 16...]
    #pragma acc data copy(A[0:high]) copyin(temp[0:high], i, j, k)
                for (m = 1; m <= high - low; m = 2 * m) {

                        #pragma acc parallel loop gang worker vector independent \
                        present(A[0:high], temp[0:high]) \
                        for (s = low; s < high; s += 2 * m) {
                            int from = s;
                            int mid = s + m - 1;
                            int to = min(s + 2 * m - 1, high);

                            k = from;
                            i = from;
                            j = mid + 1;

                            #pragma acc loop independent
                            for (k = from; k <= to; k++) {
                                if (i <= mid && j <= to) {
                                    if (A[i] < A[j]) {
                                        temp[k] = A[i++];
                                    else {
                                        temp[k] = A[j++];
                                } else break;

                            #pragma acc loop independent
                            for(int f=i; f < high; f++){
                                if(f>mid) break;
                                temp[k++] = A[f];

                        #pragma acc parallel loop gang worker vector independent
                        for (int i = 0; i <= high; i++) {
                            A[i] = temp[i];

/* Driver program to test above functions */
int main(int argc, char **argv)
    clock_t ts, te;
    int n = atoi(argv[1]);

    double *arr = (double *)malloc(n*sizeof(double));
    double *tmp = (double *)malloc(n*sizeof(double));

    // Init Random Seed

    int i;
    /** Initialize array with random values **/
    for (i = 0; i < n; i++) {
        arr[i] = rand()%n;
        tmp[i] = arr[i];
    /** End random initialization **/

    ts = clock();
    mergesort(arr, tmp, 0, n - 1);
    te = clock();

    } else{
        printf("Not sorted\n");
    printf("Time: %0.06fs\n", (double)(te-ts)/CLOCKS_PER_SEC);
    return 0;

The output during compilation is:

pgcc -O3 -acc -fast -ta=tesla:managed -Minfo=accel -lm -o mergesort_acc main.c
     50, Generating copyin(k,temp[:high]) [if not already present]
         Generating copy(A[:high]) [if not already present]
         Generating copyin(i,j) [if not already present]
     56, Generating present(temp[:high],A[:high])
     60, Generating Tesla code
         63, #pragma acc loop gang, worker(4), vector(32) /* blockIdx.x threadIdx.y threadIdx.x */
     75, Loop carried scalar dependence for j at line 76
         Loop carried scalar dependence for i at line 78
         Loop carried scalar dependence for j at line 77
         Loop carried scalar dependence for i at line 76,77
         Loop carried scalar dependence for j at line 81
     87, Loop carried dependence of temp-> prevents parallelization
         Loop carried backward dependence of temp-> prevents vectorization
     93, Generating Tesla code
         94, #pragma acc loop gang, worker(4), vector(32) /* blockIdx.x threadIdx.y threadIdx.x */

Hi Shmite,

The inner “k” and “f” loops aren’t parallelizable for two main reasons.

  1. You have “break” in there which would cause an early exit. All iterations needed to be independent, but a break causes a dependency since you want the iterations to come after the break not to execute and the only way to do that would be for the previous iteration to complete before a subsequent iteration can be executed.
  2. You are incrementing the “i”, “j”, and “k” variables. Since these are shared within the “s” loop, this will cause a race condition where one iteration may read a variable before another iteration increments it. Hence it reads a stale value.

Note that “i”, “j”, and “k”, can’t be both a global and private variable. Hence, you should remove them from the data directive since this will make them global (though, I think the compiler is ignore that you have them there).

Also, “inpendendent” is implied for “parallel” regions, so isn’t needed (though it doesn’t hurt except for extra typing). Also, you don’t need to specify scalars as being “firstprivate” since this is the default.

I couldn’t link your code since the definition for “check_sorted” is missing, so I commented it out in order to run and do see the poor performance.

You’ll want to run the code through a profiler like Nsight-Compute to understand what’s going on with the performance. But given the code is just moving bits, it’s most likely memory bound. And since the inner loops aren’t parallelizable, the memory isn’t getting accessed contiguously memory so memory divergence is quite high.

While I don’t have any good suggestions on how to fix this, I did find this thread on StackOverflow for a CUDA implementation of merge sort: with the key point being:

The GPU is actually slower than the naive single-threaded CPU code when the memory organization is row major. But for the column-major organization (which tends to improve opportunities for coalesced access) the GPU code is about 10x faster than the CPU code for my test case. This ~10x speedup factor is roughly in the range (~10-20x) of the speedup factors shown in the paper for a GPU MergePath 32-bit integer speedup vs. x86 serial merge.

So while I don’t think you’ll be able to parallelize the inner loops (at least in the few CUDA implementations that I looked at, they all did this part sequentially), you might be able to use a stride as part of the index to get better memory access.