I have some 2-dimension data. I need to sort each column of these data. I am using radix sort to finish this task.
I met a problem. I thought my sorting algorithm was not related to the data itself. But the test results suggest otherwise.
My code of sorting algorithm is:
__global__ void radix_sort(uint32_t* data, uint32_t cols, uint16_t rows, uint32_t* bucket) {
const int tid = blockDim.x * blockIdx.x + threadIdx.x;
uint16_t count1 = rows, count2 = rows;
uint16_t count3, count4;
if (tid < cols) {
for (uint8_t pos = 0; pos < 32; pos++) {
if (pos % 2 == 0) {
count3 = 0, count4 = rows - 1;
for (uint16_t i = 0; i < count1; i++) {
if ((data[tid + i * cols] & (1 << pos)) == 0) {
bucket[tid + count3 * cols] = data[tid + i * cols];
count3 += 1;
} else {
bucket[tid + count4 * cols] = data[tid + i * cols];
count4 -= 1;
}
}
for (uint16_t i = rows - 1; i > count2; i--) {
if ((data[tid + i * cols] & (1 << pos)) == 0) {
bucket[tid + count3 * cols] = data[tid + i * cols];
count3 += 1;
} else {
bucket[tid + count4 * cols] = data[tid + i * cols];
count4 -= 1;
}
}
} else {
count1 = 0, count2 = rows - 1;
for (uint16_t i = 0; i < count3; i++) {
if ((bucket[tid + i * cols] & (1 << pos)) == 0) {
data[tid + count1 * cols] = bucket[tid + i * cols];
count1 += 1;
} else {
data[tid + (count2 * cols)] = bucket[tid + i * cols];
count2 -= 1;
}
}
for (uint16_t i = rows - 1; i > count4; i--) {
if ((bucket[tid + i * cols] & (1 << pos)) == 0) {
data[tid + count1 * cols] = bucket[tid + i * cols];
count1 += 1;
} else {
data[tid + (count2 * cols)] = bucket[tid + i * cols];
count2 -= 1;
}
}
}
}
for (uint16_t i = 0; i < count1; i++) bucket[tid + i * cols] = data[tid + i * cols];
for (uint16_t i = count2 + 1; i < rows; i++) bucket[tid + i * cols] = data[tid + (count2 + rows - i) * cols];
}
}
In the parameters of kernel function, cols represents total number of columns in the data, rows is total number of rows. Bucket is the temporary space for sorting, and the data after sorting is also stored in bucket.
In my understanding, since I am using a non-comparison algorithm, the execution speed of the sorting kernel function should be the same regardless of how the data is distributed. However, in reality, the more irregular the data, the longer it takes to sort; the more regular the data, for example, if each row is the same, the faster the sorting.