Assume that I would like to implement bubble sort in CUDA,

using one thread per array index, similar in spirit to other sorting networks.

Now, what I am stumbling over is the question how to swap elements between adjacent threads that reside in different blocks.

For instance, the last thread in the first block and the first thread in the second block need to compare-and-swap their elements.

For more detail:

I would like to place the iteration *inside* the kernel, i…e, no inter-block synchronisation using the implicit sync between kernel launches.

(Each thread has to compare-and-swap its element with the neighbour N times.)

Also, I don’t want to launch just one block of threads , because, for the sake of exercise / argument, I would like to do bubble sort for millions of elements.

I have searched the web and this forum.

I also looked for inter-block synchronisation in CUDA, but all approaches I have seen seemed to me not really safe, i.e., it seemed to me that they could fail under some circumstances.

Any ideas how to achieve bubble sort in CUDA using many blocks will be appreciated.

I think your question is basically irrelevant. Learn how to turn any dimensional problem into a linear dimension problem and then you can solve any problem.

Therefore instead of thinking in turns of threads, blocks, grids, etc, think in term of 0 to N-1 threads.

I wish nvidia would offer an alternative api to launch threads, which automatically calculates into into optimal distribution among their sub chips so that these kinds of questions go away.

Your question is basically the same as asking: how do I swap two sectors on a harddisk which resides on different platters.

Since LBA this would be an irrelevant question.

I don’t think I can answer your question directly without doing a merge sort (which indirectly violates the bubble-sort requirement) after individual block bubble sorts.

SO, let’s start from the very basics.

First of all I’m assuming that you have a very good reason to use a bubble sort (and that you know pre-existing libraries such as Thrust already have very good sort capabilities that will numerically destroy bubble sorts). But I will take that for granted.

With floats you could use atomic operations to exchange the data between blocks. If the datatype is unsupported by atomics, you can use the atomic hack for double precision values found on one of the CUDA pages (sorry don’t have the source handy), though the hack versions are obviously much slower.

Even then you’re going to have synchronization problems.

You’ve found a very good example of something that is not well suited to gpus. That is why people have found equivalent (and much more efficient, too) sort algorithms for the gpu.

sBc-Random is correct about all points. The thrust::sort() is awesome for primitives especially.

If you are trying to learn CUDA, start with reductions/scans then move on to less typical applications.

Types of algorithms which map well to GPUs:

brute force

image processing

radix/merge sort

linear algebra routines

reductions/scans

CFD

monte-carlo simulations

fft(s)

BFS based graph problems

Types of problems which map poorly to GPUs:

hash tables (can do but not worth it generally)

tree based traversals (binary or most trees)

DFS based graph algorithms

bubble sort or similar sort implementations

server side applications