How do I sort using CUDA?

11friends1119 · July 9, 2019, 7:01am

Hello? I am a student studying CUDA.

Since CUDA is different from C language coding method, it seems to be different from C language Merge sort, Bitonic sort and so on, but I do not know how to do it.

I have tried to refer to CUDA Sort Sample, but I do not understand it. Is there any coding or open source to refer to Sorting with CUDA?

Thank you for reading and would appreciate your reply.

Robert_Crovella · July 9, 2019, 2:19pm

There is a lot of good information on GPU based parallel sorting on moderngpu

Here is one example for merge sort:

[url]https://moderngpu.github.io/mergesort.html[/url]

(fast, efficient) sorting can be a challenging task in a thread-parallel architecture like CUDA. So although simple algorithms like bubble sort and merge sort are easy to understand in a naive serial implementation, if you want to write a fast, efficient sort in CUDA, the complexity is substantially higher than these trivial serial versions. So unless you have considerable expertise with CUDA, it is not surprising in my opinion that these sort methods seem hard to understand.

There are also scientific/technical papers written about sorting on the GPU. You are likely to find more complete treatments there than anything someone is likely to write for you in an ad-hoc fashion on this forum.

Here is one such example:

[url]https://arxiv.org/pdf/1709.02520.pdf[/url]

FloridaMan · July 9, 2019, 3:30pm

I am also very new to CUDA (maybe a couple weeks into it), but I’m one of the best when it comes to working with relational databases. I’ve found things that carry over to CUDA from databases and embedded to be very important to keep in mind.

Hopefully I’m correct on this. I’ve identified two fundamentals with CUDA that are foreign to standard C development:

It is important to think in sets.

It’s my understanding that one CUDA warp processes 32 threads simultaneously, but not like a processor that is switching threads, and not like two CPU cores working independently at the same time. The CUDA warp has 32 copies of thread-specific registers and the instruction applies to every single thread at the same time. If you have an if…else block and one thread has the “if” portion as true and the other is false, the instructions for both branches are executed (only they are not applied to the false portion of the branch). This is called “Warp Divergence”. Any execution that is not a multiple of 32 threads leaves processing power on the table. Any execution that has a large number of branching if statements and control flow can create a penalty because of execution. I believe this is to increase density on the silicone chips. 32 is likely the Goldilocks number for the warp size.

This guy does a good job explaining this: [url]http://people.maths.ox.ac.uk/~gilesm/cuda/lecs/lec3-2x2.pdf[/url]

Understand #1 and then it makes sense why external memory operations are so expensive. I believe, and I’m not yet sure on this, that when you have a bunch of threads in the same warp that need to access external memory randomly, and they’re not accessing the same page, that when external memory is being fetched that all other threads in a warp must wait for the fetch operation to complete.

I think of a warp like this machine. Each pancake is a thread. If there was no pancake in a position, the machine must still wait as all threads perform their instruction at the same time.

Topic		Replies	Views
cuda integer operations and simt for sorting CUDA Programming and Performance	7	8873	July 25, 2009
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4492	October 24, 2008
How do CUDA cores on a SM execute warps concurrently? CUDA Programming and Performance	8	28707	July 4, 2019
Programming issue How to convert for loops into threads CUDA Programming and Performance	4	2185	February 13, 2010
Random memory access and += Advice needed CUDA Programming and Performance	4	2413	August 17, 2008
CUDA basic discussions(Looping, branching) Why GPU faster than CPU when both use C? CUDA Programming and Performance	6	18682	August 22, 2007
execution time on CPU and GPU CUDA Programming and Performance	6	1131	February 26, 2015
Can CUDA do sequential processing? CUDA Programming and Performance	7	6584	August 24, 2011
Race condition within warp CUDA Programming and Performance	9	2982	September 20, 2016
Optimize - Many small operations (CPU is faster for now?) CUDA Programming and Performance	2	512	July 11, 2019

How do I sort using CUDA?

Related topics