My Syncthreads function seems to be a pain in the a**

ganeshprofess · July 25, 2023, 2:57pm

So I first tried Bitonic Sort on GPU using CUDA and it worked successfully. I also had CUDA Merge sort and that worked well too. The same functions, when I combined them into a single project to be run in an if else choice, the __syncthreads() on my Bitonic Sort GPU function is always throwing an error. Can’t seem to figure out why.

This is the function
// GPU kernel for Bitonic Sort
global void bitonicSortGPU(int* arr, int size) {
shared int sharedArr[8192];

int tid = threadIdx.x;
int gid = threadIdx.x + blockIdx.x * blockDim.x;

// Load data from global memory to shared memory
if (gid < size) {
    sharedArr[tid] = arr[gid];
}
else {
    // Set out-of-range elements to a large value (sentinel)
    sharedArr[tid] = INT_MAX;
}

// Synchronize to ensure all threads have loaded the data
__syncthreads();

// Bitonic sort algorithm
for (int k = 2; k <= size; k *= 2) {
    for (int j = k / 2; j > 0; j /= 2) {
        int ixj = tid ^ j;

        // Check if the indices are within bounds
        if (ixj < size) {
            // Sort in ascending order
            if (tid < ixj) {
                if ((tid & k) == 0 && sharedArr[tid] > sharedArr[ixj]) {
                    int temp = sharedArr[tid];
                    sharedArr[tid] = sharedArr[ixj];
                    sharedArr[ixj] = temp;
                }
                if ((tid & k) != 0 && sharedArr[tid] < sharedArr[ixj]) {
                    int temp = sharedArr[tid];
                    sharedArr[tid] = sharedArr[ixj];
                    sharedArr[ixj] = temp;
                }
            }
        }

        // Synchronize after each comparison and swap
        __syncthreads();
    }
}

// Copy sorted data back to global memory
if (gid < size) {
    arr[gid] = sharedArr[tid];
}

}

This is the way I am calling the function. I have allocated necessary cuda containers outside the else block and deallocating after outside.
else
{
// GPU variables
int blocksPerGrid = (size + threadsPerBlock - 1) / threadsPerBlock;

    cudaEventRecord(startGPU);
    bitonicSortGPU <<<blocksPerGrid, threadsPerBlock >>> (gpuArr, size);
    cudaEventRecord(stopGPU);

    // Perform CPU Bitonic Sort and measure time
    startCPU = clock();
    bitonicSortCPU(carr, size);
    endCPU = clock();
}

Topic		Replies	Views
Error/bug(?) in bitonic merge, related to shared memory usage. CUDA Programming and Performance	2	3432	February 12, 2010
sorting on the GPU CUDA Programming and Performance	2	21422	May 20, 2007
How can I use __syncthreads() in ray gen program? OptiX	2	596	June 14, 2022
Bitonic Sort CUDA Programming and Performance	2	1049	August 10, 2013
modifying bitonic sort CUDA Programming and Performance	2	1209	November 17, 2015
Bitonic-Sorting Networks CUDA Sample help. CUDA Programming and Performance	3	2660	August 13, 2013
does this code have problem? CUDA Programming and Performance	6	3867	December 9, 2007
Cuda: threads over 2 warps not synchronising correctly Legacy PGI Compilers	5	6888	May 26, 2011
Precision Problems CUDA Programming and Performance	3	2137	May 9, 2012
IS __syncthread() resetting shared memory values? CUDA Programming and Performance	2	711	August 9, 2018

My Syncthreads function seems to be a pain in the a**

Related topics