bucket sort on GPU Any ideas?

GUYS,
I want to make a bucket sort on GTX 8800 card. Let me put up the details of bucket sort according to my application.
There is an array: a[1000000], all the 1000000 datas can be sorted into M buckets according to my situation.What i want to do is just devide all the datas into M buckets, at the same time must guarantee the order,or the stability.
For example: a[10]={1,2,5,6,3,4,8,7,9,10};
result:
bucket[1]={1,5,3,7,9};
bucket[2]={2,6,4,8,10};
Any tips or advices would be much appreciated!THX. External Image

This looks like a stage of a radix sort. The most straight forward way is to use a 2-pass strategy. The first pass is the counting part, and in the second part you actually move the data.

In the counting part, it’s easy to run in parallel: you just divide the data into several smaller parts, and count for each buckets. Then you can use a scan to compute the “start address” for each bucket. Then in the second stage each thread can examine their own parts and write to the correct address.

The tricky part here is, the writing in the second part is a bit random and can be very slow. Therefore you probably have to use shared memory for buffering the writings to make it less random.

You are right.This is also what I thought out first. But I think it’s too complex.

THUS , I should lanuch two kernels:one for counting,another for moving data.I have read some articles for GPU bucket sorting ,but they don’t fit for my case! :wacko:

Can you give me more details about “how to use shared memory for buffering the writhing to make it less random”?

THX for your reply. :rolleyes:

The basic idea is like this: since for each thread the writings are continuous w.r.t. each bucket, so in theory you can allocate a small amount of shared memory for each thread for their writing, and write to the global memory only when the buffer is full. This way, all writings to global memory is continuous and it’s easier to make them coalesced. Unfortunately, there are some practical difficulties. The first difficulty is that the shared memory is quite small. So if you have a lot of threads, or a lot of buckets, or (worse) both, you probably don’t have enough space for these buffers. The second one is related to the behavior of CUDA. To make writings coalesced, each word has to be written from different threads. That means, you may have to check every buffer whether they are full or not in every threads to determine whether they need to be written to the global memory. However, it’s probably still faster than a completely random memory access, though.

Agreed!BUt it’s much more complex.

Thank you for your explanation.

I wonder is there any good ideas? If the device support address pointer and can allocate memory space andomly, it would be no longer a question. External Media

You will find when you begin to write that even the simple counting kernel discussed so far is extremely challenging to implement efficiently and have it be any faster than the host. I’ve been fiddling with a binning method (a bucket sort for particles in subdivided space) for 2 years now. Any method I came up with was slower than the time it took to device->host copy, bin on CPU, host->device copy and continue running.

Until now, that is. I finally came up with a method that is equal speed to the host on a single GPU of a 9800 GX2 and ~3-4x faster on a Tesla S1070 (it is so much faster compared to G92 because of the relaxed coalescing requirements). My method works in one pass, but you won’t like that because it means the sort is not stable.

To apply this method and get a stable bucket sort over the whole array, you will need to modify what I do on a block by block level and do it globally (i.e. use radix sort and then a kernel to determine the 0/1 array, then scan the whole array, then stream compact or some other method to put the particles in buckets)

Here is what I do:

launch 1 thread per element

determine bucket of each element

in each block individually:

   bitonic sort elements using the bucket as a key

   in each block individually, make an block_size array of 0's and 1's. Set 1's at the points where the bucket key jumps up

   scan this array to get destination indices

   subtract destination indices to get the number of elements in each bucket local to this block

   atomicInc the global memory location listing the size of the bucket (only 1 atomicInc per unique bucket)

   particle id's are staged in shared memory: write them out at the start value returned from atomicInc and up (makes for decent coalescing on G200)

   done!

Now, I should point out that this method is only faster than the host for somewhat ordered data where there are a lot of elements with the same bucket in the same block to cut down on the number of atomicIncs made. For completely random input data, it is about 4x slower than the host (on 9800 GX2).

It will be a big trouble for me if bucket sort on gpu behaves slower than that running on CPU.

I will take a experiment about the simple method using two pass.

Thanks M42 for sharing this method. I will take a good understand about that. External Media

There are two papers about BUCKET SORT. I don’t know if they have anything about your case.Maybe they can offer a mean or idea.

Enclosed please find the papers. External Media

Interesting post, Mister Anderson.

Note that is basically what the particles sample in the SDK does, except it does a global radix sort on the bucket id and then compares each element in the sorted list to find the start of each bucket.

I do like the idea of doing a bitonic sort per block, though (to avoid the atomic writes).

In my experience it is definitely faster doing the sort on the GPU than doing it on the CPU and transferring the data back and forth.

Thanks.

Yep, that is where I got the idea from (that and SPWorley reminding me at least several times that sorting is how pretty much everyone else has tackled binning). I guess I forgot to give credit where it is due!

My experiences are only in a very narrow portion of parameter space. I’ve got a particular application in mind: in the range of ~50,000 particles, ~3,000 bins and a particular typical arrangement of these particles (they are sorted so particles near each other in space are near each other in memory). It could very well be that over a wider parameter space global sorting wins overall.

Result: Using the simplest bucket sort method implemented on 1000000 data can be faster about ~1 times than that’s on CPU accroding to my case. External Media

Hello Austin, could you post your code for the bucket sorting? I’m also trying to implement it.

Thanks.

The really easy way to do this would be to use thrust. See http://code.google.com/p/thrust/wiki/Tutorial for a general introduction. These slides: http://thrust.googlecode.com/files/IntroductionToThrust.pdf have starting on page 21 pretty much the exact algorithm you want (except that instead of mapping a float2 to a bucketid, you would map from int to int).

Hello zarnick , I just use the radixsort.cu,radixsort.cuh,and radixsort_kernel.cu. Using the key in KeyValuePair struct as the bucket id. Good luck for you!

It’s so long time for this post, maybe there are other better methods now, whatever, good luck!

For a very simple bucket sort, I’ve made this program, it uses bubble sort for sorting each bucket, and all numbers have to be greater or equal to 0 up to SIZE. What I’m failling to understand is that I can’t use a SIZE greater then 1024, if I use for instance the double I get “cudaMalloc: out of memory”, however I’ve made a bitonic sort that allocates memory exactly the same (but it’s not a structure) and can allocate up to 2^25, which is 33.554.432 of random numbers. Any tough on what could be wrong?

Here’s the full listing of the code:

include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <errno.h>

#include <cuda.h>

#define SIZE 1024

typedef struct bucket

{

  int amount;

  int numbers;

}bucket;

__global__ void sortBucket(struct bucket *buckets, int maxBuckets)

{

  int idx = blockDim.x*blockIdx.x + threadIdx.x;

  int tmp;

  int swapped = 0;

  if(idx < maxBuckets)

  {

	do{

	  swapped = 0;

	  for(int i = 0; i < buckets[idx].amount-1; i++)

	  {

		if(buckets[idx].numbers[i] > buckets[idx].numbers[i+1])

		{

		  tmp = buckets[idx].numbers[i+1];

		  buckets[idx].numbers[i+1]=buckets[idx].numbers[i];

		  buckets[idx].numbers[i]=tmp;

		  swapped = 1;

		}

	  }

	}while(swapped);

  }

}

int main(int argc, char *argv[])

{

  //Host memory

  int *numbers;

  struct bucket *h_buckets;

  //Device memory

  struct bucket *d_buckets;

size_t memSize = sizeof(int)*SIZE;

  dim3 blockSize(128);

  dim3 qtdBlocks(512);

  int bucketAmount = blockSize.x*qtdBlocks.x;

  int t = 0,k = 0;

  //Error handling and kernel time

  cudaError_t d_error = cudaSuccess;

  cudaEvent_t start,stop;

  float GPU_time;

//Creating events (for catching the time used)

  cudaEventCreate(&start);

  cudaEventCreate(&stop);

//Allocate host memory

  if((numbers=(int*)malloc(memSize))==NULL)

  {

	perror("malloc");

	return 1;

  }

  memset(numbers,0,memSize);

if((h_buckets=(struct bucket*)malloc(bucketAmount*sizeof(struct bucket)))==NULL)

  {

	perror("malloc");

	return 1;

  }

  memset(h_buckets,'

include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <errno.h>

#include <cuda.h>

#define SIZE 1024

typedef struct bucket

{

int amount;

int numbers;

}bucket;

global void sortBucket(struct bucket *buckets, int maxBuckets)

{

int idx = blockDim.x*blockIdx.x + threadIdx.x;

int tmp;

int swapped = 0;

if(idx < maxBuckets)

{

do{

  swapped = 0;

  for(int i = 0; i < buckets[idx].amount-1; i++)

  {

	if(buckets[idx].numbers[i] > buckets[idx].numbers[i+1])

	{

	  tmp = buckets[idx].numbers[i+1];

	  buckets[idx].numbers[i+1]=buckets[idx].numbers[i];

	  buckets[idx].numbers[i]=tmp;

	  swapped = 1;

	}

  }

}while(swapped);

}

}

int main(int argc, char *argv)

{

//Host memory

int *numbers;

struct bucket *h_buckets;

//Device memory

struct bucket *d_buckets;

size_t memSize = sizeof(int)*SIZE;

dim3 blockSize(128);

dim3 qtdBlocks(512);

int bucketAmount = blockSize.x*qtdBlocks.x;

int t = 0,k = 0;

//Error handling and kernel time

cudaError_t d_error = cudaSuccess;

cudaEvent_t start,stop;

float GPU_time;

//Creating events (for catching the time used)

cudaEventCreate(&start);

cudaEventCreate(&stop);

//Allocate host memory

if((numbers=(int*)malloc(memSize))==NULL)

{

perror("malloc");

return 1;

}

memset(numbers,0,memSize);

if((h_buckets=(struct bucket*)malloc(bucketAmount*sizeof(struct bucket)))==NULL)

{

perror("malloc");

return 1;

}

memset(h_buckets,’

include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <errno.h>

#include <cuda.h>

#define SIZE 1024

typedef struct bucket

{

  int amount;

  int numbers;

}bucket;

__global__ void sortBucket(struct bucket *buckets, int maxBuckets)

{

  int idx = blockDim.x*blockIdx.x + threadIdx.x;

  int tmp;

  int swapped = 0;

  if(idx < maxBuckets)

  {

	do{

	  swapped = 0;

	  for(int i = 0; i < buckets[idx].amount-1; i++)

	  {

		if(buckets[idx].numbers[i] > buckets[idx].numbers[i+1])

		{

		  tmp = buckets[idx].numbers[i+1];

		  buckets[idx].numbers[i+1]=buckets[idx].numbers[i];

		  buckets[idx].numbers[i]=tmp;

		  swapped = 1;

		}

	  }

	}while(swapped);

  }

}

int main(int argc, char *argv[])

{

  //Host memory

  int *numbers;

  struct bucket *h_buckets;

  //Device memory

  struct bucket *d_buckets;

size_t memSize = sizeof(int)*SIZE;

  dim3 blockSize(128);

  dim3 qtdBlocks(512);

  int bucketAmount = blockSize.x*qtdBlocks.x;

  int t = 0,k = 0;

  //Error handling and kernel time

  cudaError_t d_error = cudaSuccess;

  cudaEvent_t start,stop;

  float GPU_time;

//Creating events (for catching the time used)

  cudaEventCreate(&start);

  cudaEventCreate(&stop);

//Allocate host memory

  if((numbers=(int*)malloc(memSize))==NULL)

  {

	perror("malloc");

	return 1;

  }

  memset(numbers,0,memSize);

if((h_buckets=(struct bucket*)malloc(bucketAmount*sizeof(struct bucket)))==NULL)

  {

	perror("malloc");

	return 1;

  }

  memset(h_buckets,'\0',sizeof(struct bucket)*bucketAmount);

//Allocate device memory

  d_error = cudaMalloc((void**)&d_buckets, sizeof(struct bucket)*bucketAmount);

  if(d_error != cudaSuccess)

  {

	fprintf(stderr,"cudaMalloc: %s\n",cudaGetErrorString(d_error));

	return 1;

  }

//Create a random amount of numbers

  srand(time(NULL));

  for(int i = 0; i < SIZE; i++)

	numbers[i]=rand()%SIZE;

//Put into the buckets

  t=k=0;

  for(k=0;k<bucketAmount;k++)

  {

	h_buckets[k].amount=0;

	for(int i = 0; i < SIZE;i++)

	{

	  if(numbers[i] >= k*blockSize.x && numbers[i] < (k+1)*blockSize.x)

	  {

		h_buckets[k].amount++;

		h_buckets[k].numbers[t++]=numbers[i];

		numbers[i]=-1;

	  }

	}

	t=0;

  }

  //Copy to device memory and call kernel

  d_error = cudaMemcpy(d_buckets, h_buckets, sizeof(struct bucket)*bucketAmount,cudaMemcpyHostToDevice);

  if(d_error != cudaSuccess)

  {

	fprintf(stderr,"cudaMemcpy: %s\n",cudaGetErrorString(d_error));

	return 1;

  }

  cudaEventRecord(start,0);

  sortBucket<<<qtdBlocks, blockSize>>>(d_buckets,bucketAmount);

  cudaThreadSynchronize();

  d_error = cudaGetLastError();

  if(d_error != cudaSuccess)

  {

	fprintf(stderr,"sortBucket<<<%d, %d>>>: %s\n",qtdBlocks.x, blockSize.x, cudaGetErrorString(d_error));

	return 1;

  }

  cudaEventRecord(stop,0);

  cudaEventSynchronize(stop);

//Gatter GPU spent time

  cudaEventElapsedTime( &GPU_time, start, stop);

//Copy from device to memory

  d_error = cudaMemcpy(h_buckets, d_buckets, sizeof(struct bucket)*bucketAmount,cudaMemcpyDeviceToHost);

  if(d_error != cudaSuccess)

  {

	fprintf(stderr,"cudaMemcpy: %s\n",cudaGetErrorString(d_error));

	return 1;

  }

//Re-create numbers (arranged)

  k=0;

  for(int i=0;i<bucketAmount;i++)

  {

	for(int j=0; j < h_buckets[i].amount; j++)

	  numbers[k++]=h_buckets[i].numbers[j];

  }

  //Proof

  for(int i = 0; i < SIZE;i++)

	printf("%d ",numbers[i]);

  printf("\n");

  printf("GPU Time:%f ms\n",GPU_time);

//Cleanup

  free(numbers);free(h_buckets);

  cudaFree(d_buckets);

  cudaEventDestroy(start);

  cudaEventDestroy(stop);

return 0;

}

',sizeof(struct bucket)*bucketAmount);

//Allocate device memory

d_error = cudaMalloc((void**)&d_buckets, sizeof(struct bucket)*bucketAmount);

if(d_error != cudaSuccess)

{

fprintf(stderr,"cudaMalloc: %s\n",cudaGetErrorString(d_error));

return 1;

}

//Create a random amount of numbers

srand(time(NULL));

for(int i = 0; i < SIZE; i++)

numbers[i]=rand()%SIZE;

//Put into the buckets

t=k=0;

for(k=0;k<bucketAmount;k++)

{

h_buckets[k].amount=0;

for(int i = 0; i < SIZE;i++)

{

  if(numbers[i] >= k*blockSize.x && numbers[i] < (k+1)*blockSize.x)

  {

	h_buckets[k].amount++;

	h_buckets[k].numbers[t++]=numbers[i];

	numbers[i]=-1;

  }

}

t=0;

}

//Copy to device memory and call kernel

d_error = cudaMemcpy(d_buckets, h_buckets, sizeof(struct bucket)*bucketAmount,cudaMemcpyHostToDevice);

if(d_error != cudaSuccess)

{

fprintf(stderr,"cudaMemcpy: %s\n",cudaGetErrorString(d_error));

return 1;

}

cudaEventRecord(start,0);

sortBucket<<<qtdBlocks, blockSize>>>(d_buckets,bucketAmount);

cudaThreadSynchronize();

d_error = cudaGetLastError();

if(d_error != cudaSuccess)

{

fprintf(stderr,"sortBucket<<<%d, %d>>>: %s\n",qtdBlocks.x, blockSize.x, cudaGetErrorString(d_error));

return 1;

}

cudaEventRecord(stop,0);

cudaEventSynchronize(stop);

//Gatter GPU spent time

cudaEventElapsedTime( &GPU_time, start, stop);

//Copy from device to memory

d_error = cudaMemcpy(h_buckets, d_buckets, sizeof(struct bucket)*bucketAmount,cudaMemcpyDeviceToHost);

if(d_error != cudaSuccess)

{

fprintf(stderr,"cudaMemcpy: %s\n",cudaGetErrorString(d_error));

return 1;

}

//Re-create numbers (arranged)

k=0;

for(int i=0;i<bucketAmount;i++)

{

for(int j=0; j < h_buckets[i].amount; j++)

  numbers[k++]=h_buckets[i].numbers[j];

}

//Proof

for(int i = 0; i < SIZE;i++)

printf("%d ",numbers[i]);

printf(“\n”);

printf(“GPU Time:%f ms\n”,GPU_time);

//Cleanup

free(numbers);free(h_buckets);

cudaFree(d_buckets);

cudaEventDestroy(start);

cudaEventDestroy(stop);

return 0;

}

',sizeof(struct bucket)*bucketAmount);

//Allocate device memory

  d_error = cudaMalloc((void**)&d_buckets, sizeof(struct bucket)*bucketAmount);

  if(d_error != cudaSuccess)

  {

	fprintf(stderr,"cudaMalloc: %s\n",cudaGetErrorString(d_error));

	return 1;

  }

//Create a random amount of numbers

  srand(time(NULL));

  for(int i = 0; i < SIZE; i++)

	numbers[i]=rand()%SIZE;

//Put into the buckets

  t=k=0;

  for(k=0;k<bucketAmount;k++)

  {

	h_buckets[k].amount=0;

	for(int i = 0; i < SIZE;i++)

	{

	  if(numbers[i] >= k*blockSize.x && numbers[i] < (k+1)*blockSize.x)

	  {

		h_buckets[k].amount++;

		h_buckets[k].numbers[t++]=numbers[i];

		numbers[i]=-1;

	  }

	}

	t=0;

  }

  //Copy to device memory and call kernel

  d_error = cudaMemcpy(d_buckets, h_buckets, sizeof(struct bucket)*bucketAmount,cudaMemcpyHostToDevice);

  if(d_error != cudaSuccess)

  {

	fprintf(stderr,"cudaMemcpy: %s\n",cudaGetErrorString(d_error));

	return 1;

  }

  cudaEventRecord(start,0);

  sortBucket<<<qtdBlocks, blockSize>>>(d_buckets,bucketAmount);

  cudaThreadSynchronize();

  d_error = cudaGetLastError();

  if(d_error != cudaSuccess)

  {

	fprintf(stderr,"sortBucket<<<%d, %d>>>: %s\n",qtdBlocks.x, blockSize.x, cudaGetErrorString(d_error));

	return 1;

  }

  cudaEventRecord(stop,0);

  cudaEventSynchronize(stop);

//Gatter GPU spent time

  cudaEventElapsedTime( &GPU_time, start, stop);

//Copy from device to memory

  d_error = cudaMemcpy(h_buckets, d_buckets, sizeof(struct bucket)*bucketAmount,cudaMemcpyDeviceToHost);

  if(d_error != cudaSuccess)

  {

	fprintf(stderr,"cudaMemcpy: %s\n",cudaGetErrorString(d_error));

	return 1;

  }

//Re-create numbers (arranged)

  k=0;

  for(int i=0;i<bucketAmount;i++)

  {

	for(int j=0; j < h_buckets[i].amount; j++)

	  numbers[k++]=h_buckets[i].numbers[j];

  }

  //Proof

  for(int i = 0; i < SIZE;i++)

	printf("%d ",numbers[i]);

  printf("\n");

  printf("GPU Time:%f ms\n",GPU_time);

//Cleanup

  free(numbers);free(h_buckets);

  cudaFree(d_buckets);

  cudaEventDestroy(start);

  cudaEventDestroy(stop);

return 0;

}

Uhm… i’m trying to implement a simple tesselator and i’ve always struggled with the problem of creating new triangles, because i were not able to assign each triangle a starting index in the micropoligon list according to the extimated mp number…
but this scan algorithm looks exactly what i need.

Do you have any sources on how to implement a fast one?

What do you mean by scan algorithm? The part to put each number on it’s respective bucket?

This:

So in my case, each thread/triangle behaves like a bucket computing the extimated primitives to be allocated; so it needs to save its data to the start of its index range.

To do this obviously each triangle has to know how many indexes are reserved by the previous triangles; implementing this with naive atomic operations lead to total serialization of the kernel… not very good External Media

Anyway i don’t need a complete sort, because after the indexes have been assigned, each primitive thread can find its triangle and its raster position with the simple index.

What I did is in serial code, but it’s very easy adapted to parallel, the main ideia is to actually know what should go in each bucket, you should make a way to make this automatically (of course) based on the amount of things (in my case numbers, in your case triangles) you want to divide and scan, then is just a mater of actually doing it. I remember I did something like this a very long time ago in MPI, the basic ideia (if I’m not mistaken) was:

  1. Create a global placeholder for all the buckets
  2. Create n amount of threads (n is arbitrary)
  3. Each thread would scan part of the whole number-vector-thingy and place it on the correct bucket (watching out for memory overwrite)
  4. When all threads were finished, you would have all your numbers-vector-thingy correcly placed in each bucket
  5. Then you sort.

Hope this helps, I haven’t had the time to make this in CUDA, but is a good exercise.

==EDIT==
A little google brought me to this PDF from nVidia Parallel Prefix Sum (Scan) with CUDA

What I did is in serial code, but it’s very easy adapted to parallel, the main ideia is to actually know what should go in each bucket, you should make a way to make this automatically (of course) based on the amount of things (in my case numbers, in your case triangles) you want to divide and scan, then is just a mater of actually doing it. I remember I did something like this a very long time ago in MPI, the basic ideia (if I’m not mistaken) was:

  1. Create a global placeholder for all the buckets
  2. Create n amount of threads (n is arbitrary)
  3. Each thread would scan part of the whole number-vector-thingy and place it on the correct bucket (watching out for memory overwrite)
  4. When all threads were finished, you would have all your numbers-vector-thingy correcly placed in each bucket
  5. Then you sort.

Hope this helps, I haven’t had the time to make this in CUDA, but is a good exercise.