Is the performance of cuda radix-sort related to data itself?

I have some 2-dimension data. I need to sort each column of these data. I am using radix sort to finish this task.

I met a problem. I thought my sorting algorithm was not related to the data itself. But the test results suggest otherwise.

My code of sorting algorithm is:

__global__ void radix_sort(uint32_t* data, uint32_t cols, uint16_t rows, uint32_t* bucket) {
    const int tid = blockDim.x * blockIdx.x + threadIdx.x;
    uint16_t count1 = rows, count2 = rows;
    uint16_t count3, count4;

    if (tid < cols) {
        for (uint8_t pos = 0; pos < 32; pos++) {
            if (pos % 2 == 0) {
                count3 = 0, count4 = rows - 1;
                for (uint16_t i = 0; i < count1; i++) {
                    if ((data[tid + i * cols] & (1 << pos)) == 0) {
                        bucket[tid + count3 * cols] = data[tid + i * cols];
                        count3 += 1;
                    } else {
                        bucket[tid + count4 * cols] = data[tid + i * cols];
                        count4 -= 1;
                    }
                }
                for (uint16_t i = rows - 1; i > count2; i--) {
                    if ((data[tid + i * cols] & (1 << pos)) == 0) {
                        bucket[tid + count3 * cols] = data[tid + i * cols];
                        count3 += 1;
                    } else {
                        bucket[tid + count4 * cols] = data[tid + i * cols];
                        count4 -= 1;
                    }
                }
            } else {
                count1 = 0, count2 = rows - 1;
                for (uint16_t i = 0; i < count3; i++) {
                    if ((bucket[tid + i * cols] & (1 << pos)) == 0) {
                        data[tid + count1 * cols] = bucket[tid + i * cols];
                        count1 += 1;
                    } else {
                        data[tid + (count2 * cols)] = bucket[tid + i * cols];
                        count2 -= 1;
                    }
                }
                for (uint16_t i = rows - 1; i > count4; i--) {
                    if ((bucket[tid + i * cols] & (1 << pos)) == 0) {
                        data[tid + count1 * cols] = bucket[tid + i * cols];
                        count1 += 1;
                    } else {
                        data[tid + (count2 * cols)] = bucket[tid + i * cols];
                        count2 -= 1;
                    }
                }
            }
        }
        for (uint16_t i = 0; i < count1; i++) bucket[tid + i * cols] = data[tid + i * cols];
        for (uint16_t i = count2 + 1; i < rows; i++) bucket[tid + i * cols] = data[tid + (count2 + rows - i) * cols];
    }
}

In the parameters of kernel function, cols represents total number of columns in the data, rows is total number of rows. Bucket is the temporary space for sorting, and the data after sorting is also stored in bucket.

In my understanding, since I am using a non-comparison algorithm, the execution speed of the sorting kernel function should be the same regardless of how the data is distributed. However, in reality, the more irregular the data, the longer it takes to sort; the more regular the data, for example, if each row is the same, the faster the sorting.

certainly the performance will be somewhat different if all the threads in a warp satisfy the if condition or all the threads in the warp satisfy the else condition, vs. the case where some threads in the warp satisfy the if and some do not, here, for example:

                if ((data[tid + i * cols] & (1 << pos)) == 0) {
                    bucket[tid + count3 * cols] = data[tid + i * cols];
                    count3 += 1;
                } else {
                    bucket[tid + count4 * cols] = data[tid + i * cols];
                    count4 -= 1;
                }

And there are certainly other examples besides that one there. Whether or not that is an answer to your question or explains the time discrepancy you are observing, I don’t know. But it certainly indicates that there may be some data dependency in the observed performance for the code you have written.

While I would normally expect the compiler to do the following optimization, there are cases I have seen where it may not. So you might try manually hoisting values out as follows:

            uint32_t ival = data[tid + i*cols];
            if ((ival & (1 << pos)) == 0) {
                bucket[tid + count3 * cols] = ival;
                count3 += 1;
            } else {
                bucket[tid + count4 * cols] = ival;
                count4 -= 1;
            }

but that would not completely eliminate the potential for the performance discrepancy I described, because issuing two writes at two different locations in bucket is not equivalent to issuing a single write to that area. Now that we have one remaining issue, we could further refactor the code to eliminate the two-write issue:

        uint32_t ival = data[tid + i*cols];
        uint32_t my_idx;
        if ((ival & (1 << pos)) == 0) {
            my_idx = tid + count3 * cols;
            count3 += 1;
        } else {
            my_idx = tid + count4 * cols;
            count4 -= 1;
        }
        bucket[my_idx] = ival;

you could try those kinds of refactorings to see if any improvement in the variability is observed. This doesn’t completely address the variability, as in one case we have a coalesced write at the end, and in the other case we don’t. (and we still have the if/else which might behave a bit differently in the two cases, performance-wise. But this should be a smaller effect than inefficient use of the memory subsystem.)

I wonder whether applying __restrict__ to the pointer arguments would induce the compiler to transform the code into the desired single-write variant.

I am still on my first coffee of the day, but it seems to me that without use of restricted pointers this is not a valid transformation for the compiler to make.

I will reiterate what I learned decades ago from a seasoned SW engineer and have found to be true since: In C and C++ any integer wants to be of type int, unless there is a good reason for it to be some other type. Generally speaking, the use of unsigned integer types places more restrictions on the compiler due to the wrap-around behavior prescribed for them. Whether that impacts any possible loop transformations here I cannot say.

Writing into different rows potentially has influence on the cache hit rate.

I would use Compute Nsight to determine, whether the number of memory accesses, there coalescing, the cache hit rate or something else is the actual reason for the performance differences.

As I understand, you mostly want to speed up the processing (and not solely optimize it for always taking the same slow time as is needed for cryptographic functions not to open a timing side-channel).

Thanks a lot for your valuable suggestion, and I have tried it. While the sorting performance has improved slightly, the time discrepancy remains quite noticeable.
I suspect the reason is that I aim to sort each column of 2-dimensional data, even the data of each row may be the same, the data of each colume is different. This means the data within the same warp is not identical. Essentially, in most cases, some threads within a warp meet the if condition, while others meet the else condition.
Thanks again for your patient interpretation.

In the worst case the diverging threads within a warp should slow down the program by a factor of 2 (and not more), as there could be 2 groups of threads in the if and else. You could try to add a __syncwarp() after each block, but I assume ptxas to do it by itself.

Thanks for your interesting and constructive suggestions. I have tried it, and it has improved the performance of my program to some extent.

Thanks a lot. I am struggling dealing with the speed of my program. One method of speeding up is to add the following code at the beginning of my kernel function:

uint8_t idx_max = 0;      //the higest bit position of 1 in the maximum value
uint32_t data_max = 0;     

for (uint16_t k = 0; k < rows; k++) {
    if (data[tid + k * cols] > data_max) data_max = data[tid + k * cols];
}
idx_max = 32 - __clz(data_max);

if (tid < cols) {
        for (uint8_t pos = 0; pos < idx_max; pos++) {
............

Then the performance of kernel function is definitely related to the distribution of data. However, the speed is still not enough. I have no other ideas temporarily. There seems to be little discussion online about sorting each column of data.

Use Nsight Compute to find the limiting factor. Is it the computation speed? The memory accesses? General bandwidth, cache hit rate, not being colasced enough, or not enough threads/warp active?

perhaps counter-intuitively, the sorting of columns doesn’t lead as easily to parallel acceleration as the sorting of rows. There is considerable work done on CUDA sorting where the data is adjacent. If you can convert your column-sort into a row-sort, the problem is easily satisfied with a single thrust call (a segmented sort, i.e. sort-by-key) or via cub, and it might be faster than what you have now.

I’m not suggesting you transpose the data to make this possible, but if you can reorganize your data organization/storage concept to make this change, without hampering other work you are doing, then that may be another avenue to explore.

Although maybe not convenient to use here, cupy sort probably knows how to do a sensible job of sorting columns of an array.

Thanks a lot for your constructive suggestion. I have tried profile tool in NSight Compute. When sorting different data, I didn’t notice significant differences in the factors you mentioned. Of course, it could be a flaw due to my analysis method. I am a beginner to CUDA, and I am not yet familiar with performance analysis of CUDA programs.

Robert_Crovella suggested that I could convert sorting each column of data into segmented sort. I believe that is indeed a promising approach. It is not easy to further optimize my kernel function along the previous thought even if I have identified the limiting factor.

Good idea👍
There is much discussion about segmented sort online. I believe this is a promising idea.

Robert’s suggestion about sorting the rows is really intriguing (due to its counter-intuitiveness at first and making sense on second thought, so one can learn the most from it), and seems to be the best approach to try out for your case.

As general advice, when using Nsight Compute, I would recommend not only comparing different implementations (or the same implementation with different data) to each other, but estimating the best case for each performance metric and drilling it further down:

For general sorting you need O(n*log n) comparisons, radix sort OTOH uses O(n*w) with n the number of keys and w the key length.

Calculate how often you access the data, which memory you access, any cache level you would want to use. What is the maximum bandwidth of that memory or cache level? How does this compare to the measured results, the accessed data size, the number of memory transactions, the cache hit rate.

How many comparisons or computations are you executing, is your algorithm compute-bound? (For radix sort the INT32 ALU performance could be critical). What ALU operations - index calculations, numerical calculations, logical instructions - are being executed? Perhaps even look into the SASS code of your innermost loops.

Is your algorithm reaching that best case in each category? What is the bottleneck? By which factor is your algorithm slower? Why?

Alternatively, if you already reach the best case, is there an alternative organization of your algorithm or a different algorithm, which better fits?

Of course, your performance can depend on the data, but you should know/identify your main bottleneck first to analyze further, how it is affected by different data.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.