Bitonic Sort

chenxuhao · August 6, 2013, 8:18pm

Hi all,
I implement BitonicSort in some different versions based on CUDA SDK version.
For ATOMIC version (bitonicSortAtomic), I tried to use __threadfence() in __syncblocks_atomic to maintain memory consistency. But it doesn’t work (the output is incorrect). I have to call comparator_volatile instead of comparator, then I get correct result. Any idea?

// (C) Copyright 2013, University of Illinois. All Rights Reserved
#include <stdlib.h>
#include <stdio.h>
#include "parboil.h"

#define THREADS 256
#define BLOCKS 32
#define NUM_VALS 2*THREADS*BLOCKS

__device__ volatile int mutex = 0;
__device__ inline void __syncblocks_atomic(int goal) {
	__syncthreads();
//	__threadfence();
	int tx = threadIdx.x;
	if (tx == 0) {
		atomicAdd((int *)&mutex, 1);
		while(g_mutex != goal) {}
	}
	__syncthreads();
}

__device__ inline void comparator(float &A, float &B, uint dir) {
    float t;
    if ((A > B) == dir) {
        t = A;
        A = B;
        B = t;
    }
}

__device__ inline void comparator_volatile(volatile float &A, volatile float &B, uint dir) {
    float t;
    if ((A > B) == dir) {
        t = A;
        A = B;
        B = t;
    }
}

#ifdef NAIVE
__global__ void bitonicSortNaive(float *src, int stride, int size) {
  unsigned int tid = threadIdx.x + blockDim.x * blockIdx.x;
  uint dir = (tid & (size / 2)) == 0;
  unsigned int pos = 2*tid - (tid & (stride - 1));
  comparator(src[pos], src[pos+stride], dir);
}
#endif

#ifdef ATOMIC
__global__ void bitonicSortAtomic(float *src, int length) {
  uint numBlocks = gridDim.x * gridDim.y * gridDim.z;
  uint goalVal = 0;
  uint tid = threadIdx.x + blockDim.x * blockIdx.x;
  for(uint size=2; size<=length; size<<=1) {
    for(uint stride=size>>1; stride>0; stride=stride>>1) {
      uint dir = (tid & (size / 2)) == 0;
      uint pos = 2*tid - (tid & (stride - 1));
      comparator_volatile(src[pos], src[pos+stride], dir);
      if(stride>THREADS || (stride==1 && size>=THREADS)) {
        goalVal += numBlocks;
        __syncblocks_atomic(goalVal);
      }
      else
        __syncthreads();
    } // end for stride
  } // end for size
}
#endif

int main() {
  printf("[BENCH] Bitonic Sort %d elements\n", NUM_VALS);
  printf("[BENCH] Xuhao Chen <cxh@illinois.edu>\n");
#ifdef NAIVE
  printf("[BENCH] Naive version\n");
#endif
#ifdef ATOMIC
  printf("[BENCH] Atomic Barrier\n");
#endif
  float *values = (float*) malloc( NUM_VALS * sizeof(float));
  array_init(values, NUM_VALS);
  float *dev_values;
  size_t size = NUM_VALS * sizeof(float);
  cudaMalloc((void**) &dev_values, size);
  cudaMemcpy(dev_values, values, size, cudaMemcpyHostToDevice);
  dim3 blocks(BLOCKS,1);
  dim3 threads(THREADS,1);
  cudaDeviceSynchronize();

#ifdef NAIVE
  int j, k;
  for (k = 2; k <= NUM_VALS; k <<= 1) {
    for (j=k>>1; j>0; j=j>>1) {
      bitonicSortNaive<<<blocks, threads>>>(dev_values, j, k);
    }
  }
#endif

#ifdef ATOMIC
  bitonicSortAtomic<<<blocks, threads>>>(dev_values, NUM_VALS);
#endif

  cudaDeviceSynchronize();
  cudaMemcpy(values, dev_values, size, cudaMemcpyDeviceToHost);
  cudaFree(dev_values);
  free(values);
}

chenxuhao · August 8, 2013, 7:01pm

In summary, what is the difference between the following two function?

__device__ inline void comparator_volatile(volatile float &A, volatile float &B, uint dir) {
    float t;
    if ((A > B) == dir) {
        t = A;
        A = B;
        B = t;
    }
}

__device__ inline void comparator(float &A, float &B, uint dir) {
    float t;
    if ((A > B) == dir) {
        t = A;
        A = B;
        B = t;
    }
    __threadfence();
}

Could anyone help me?

chenxuhao · August 10, 2013, 6:24pm

The CUDA programming guide states:

If a variable located in global or shared memory is declared as volatile, the compiler assumes that its value can be changed or used at any time by another thread and therefore any reference to this variable compiles to an actual memory read or write instruction.

This basically means that the memory will be flushed immediately as you assign a value to the variable, and will be fetched directly from the memory (with no cache) when you try to read its value.

In you first code sample, since both A and B are volatile, 6 actual memory instructions are generated. One read/write each time you use either A or B. The good point is that other threads will be able to see that modifications earlier, while they are made. The downside is that the execution will be slower, because the caches will be disabled.

In your second code sample, on the other side, the GPU is authorized to use caches to accelerate its execution, until the end of the function, when it’s forced to issue a memory write. If both A and B are already cached, only 2 memory writes are issued. The downside is that other threads might only be able to see the changed value after the fence.

Another thing you should consider is that operations are not atomic. If other threads try to access A and B while your function is executing, they might see a partial execution of the function, in both cases. In the second code sample, this is a bit less likely to happen, because the thread will probably use its cached value, and flush the final values at once (anyway, you should not rely on this).

Also, volatile works as a faster version of __threadfence() among threads in the same warp (because threads in a warp act synchronously).

Topic		Replies	Views
Removing RAW race in global memory using __threadfence() CUDA Programming and Performance	6	1128	July 26, 2013
Correct use of _threadfence() to remove the RAW race Cannot remove race condition CUDA Programming and Performance	14	3977	April 23, 2012
difference between __threadfence_block and __syncthreads CUDA Programming and Performance	17	29795	April 22, 2015
Deadlock in busy waiting queue CUDA Programming and Performance cuda	5	457	June 13, 2024
"Unnecessary" synchronization required, but only on some cards? CUDA Programming and Performance	13	1731	July 25, 2016
Do CUDA volatiles work? CUDA Programming and Performance	0	2215	November 23, 2009
Force flush to global memory on grid level in cooperative kernels CUDA Programming and Performance	5	1359	August 13, 2019
__syncthreads and __threadfence together in a loop CUDA Programming and Performance	5	3706	October 15, 2010
interblock sync without __threadfence() ? CUDA Programming and Performance	17	8683	May 7, 2009
prevent optimization of shared memory access, but allow optimizations of global acccess CUDA Programming and Performance	2	816	November 6, 2016

Bitonic Sort

Related topics