How do I sort using CUDA?

Hello? I am a student studying CUDA.

Since CUDA is different from C language coding method, it seems to be different from C language Merge sort, Bitonic sort and so on, but I do not know how to do it.

I have tried to refer to CUDA Sort Sample, but I do not understand it. Is there any coding or open source to refer to Sorting with CUDA?

Thank you for reading and would appreciate your reply.

There is a lot of good information on GPU based parallel sorting on moderngpu

Here is one example for merge sort:

(fast, efficient) sorting can be a challenging task in a thread-parallel architecture like CUDA. So although simple algorithms like bubble sort and merge sort are easy to understand in a naive serial implementation, if you want to write a fast, efficient sort in CUDA, the complexity is substantially higher than these trivial serial versions. So unless you have considerable expertise with CUDA, it is not surprising in my opinion that these sort methods seem hard to understand.

There are also scientific/technical papers written about sorting on the GPU. You are likely to find more complete treatments there than anything someone is likely to write for you in an ad-hoc fashion on this forum.

Here is one such example:

I am also very new to CUDA (maybe a couple weeks into it), but I’m one of the best when it comes to working with relational databases. I’ve found things that carry over to CUDA from databases and embedded to be very important to keep in mind.

Hopefully I’m correct on this. I’ve identified two fundamentals with CUDA that are foreign to standard C development:

  1. It is important to think in sets.

It’s my understanding that one CUDA warp processes 32 threads simultaneously, but not like a processor that is switching threads, and not like two CPU cores working independently at the same time. The CUDA warp has 32 copies of thread-specific registers and the instruction applies to every single thread at the same time. If you have an if…else block and one thread has the “if” portion as true and the other is false, the instructions for both branches are executed (only they are not applied to the false portion of the branch). This is called “Warp Divergence”. Any execution that is not a multiple of 32 threads leaves processing power on the table. Any execution that has a large number of branching if statements and control flow can create a penalty because of execution. I believe this is to increase density on the silicone chips. 32 is likely the Goldilocks number for the warp size.

This guy does a good job explaining this:

  1. Understand #1 and then it makes sense why external memory operations are so expensive. I believe, and I’m not yet sure on this, that when you have a bunch of threads in the same warp that need to access external memory randomly, and they’re not accessing the same page, that when external memory is being fetched that all other threads in a warp must wait for the fetch operation to complete.

I think of a warp like this machine. Each pancake is a thread. If there was no pancake in a position, the machine must still wait as all threads perform their instruction at the same time.