bucket sort on GPU Any ideas?

Austin · February 10, 2009, 9:04am

GUYS,
I want to make a bucket sort on GTX 8800 card. Let me put up the details of bucket sort according to my application.
There is an array: a[1000000], all the 1000000 datas can be sorted into M buckets according to my situation.What i want to do is just devide all the datas into M buckets, at the same time must guarantee the order,or the stability.
For example: a[10]={1,2,5,6,3,4,8,7,9,10};
result:
bucket[1]={1,5,3,7,9};
bucket[2]={2,6,4,8,10};
Any tips or advices would be much appreciated!THX. External Image

pcchen · February 10, 2009, 11:13am

This looks like a stage of a radix sort. The most straight forward way is to use a 2-pass strategy. The first pass is the counting part, and in the second part you actually move the data.

In the counting part, it’s easy to run in parallel: you just divide the data into several smaller parts, and count for each buckets. Then you can use a scan to compute the “start address” for each bucket. Then in the second stage each thread can examine their own parts and write to the correct address.

The tricky part here is, the writing in the second part is a bit random and can be very slow. Therefore you probably have to use shared memory for buffering the writings to make it less random.

Austin · February 10, 2009, 11:49am

You are right.This is also what I thought out first. But I think it’s too complex.

THUS , I should lanuch two kernels:one for counting,another for moving data.I have read some articles for GPU bucket sorting ,but they don’t fit for my case! :wacko:

Can you give me more details about “how to use shared memory for buffering the writhing to make it less random”?

THX for your reply. :rolleyes:

pcchen · February 10, 2009, 12:08pm

The basic idea is like this: since for each thread the writings are continuous w.r.t. each bucket, so in theory you can allocate a small amount of shared memory for each thread for their writing, and write to the global memory only when the buffer is full. This way, all writings to global memory is continuous and it’s easier to make them coalesced. Unfortunately, there are some practical difficulties. The first difficulty is that the shared memory is quite small. So if you have a lot of threads, or a lot of buckets, or (worse) both, you probably don’t have enough space for these buffers. The second one is related to the behavior of CUDA. To make writings coalesced, each word has to be written from different threads. That means, you may have to check every buffer whether they are full or not in every threads to determine whether they need to be written to the global memory. However, it’s probably still faster than a completely random memory access, though.

Austin · February 10, 2009, 12:46pm

The basic idea is like this: since for each thread the writings are continuous w.r.t. each bucket, so in theory you can allocate a small amount of shared memory for each thread for their writing, and write to the global memory only when the buffer is full. This way, all writings to global memory is continuous and it’s easier to make them coalesced. Unfortunately, there are some practical difficulties. The first difficulty is that the shared memory is quite small. So if you have a lot of threads, or a lot of buckets, or (worse) both, you probably don’t have enough space for these buffers. The second one is related to the behavior of CUDA. To make writings coalesced, each word has to be written from different threads. That means, you may have to check every buffer whether they are full or not in every threads to determine whether they need to be written to the global memory. However, it’s probably still faster than a completely random memory access, though.

Agreed!BUt it’s much more complex.

Thank you for your explanation.

I wonder is there any good ideas? If the device support address pointer and can allocate memory space andomly, it would be no longer a question. External Media

MisterAnderson42 · February 10, 2009, 12:58pm

You will find when you begin to write that even the simple counting kernel discussed so far is extremely challenging to implement efficiently and have it be any faster than the host. I’ve been fiddling with a binning method (a bucket sort for particles in subdivided space) for 2 years now. Any method I came up with was slower than the time it took to device->host copy, bin on CPU, host->device copy and continue running.

Until now, that is. I finally came up with a method that is equal speed to the host on a single GPU of a 9800 GX2 and ~3-4x faster on a Tesla S1070 (it is so much faster compared to G92 because of the relaxed coalescing requirements). My method works in one pass, but you won’t like that because it means the sort is not stable.

To apply this method and get a stable bucket sort over the whole array, you will need to modify what I do on a block by block level and do it globally (i.e. use radix sort and then a kernel to determine the 0/1 array, then scan the whole array, then stream compact or some other method to put the particles in buckets)

Here is what I do:

launch 1 thread per element

determine bucket of each element

in each block individually:

   bitonic sort elements using the bucket as a key

   in each block individually, make an block_size array of 0's and 1's. Set 1's at the points where the bucket key jumps up

   scan this array to get destination indices

   subtract destination indices to get the number of elements in each bucket local to this block

   atomicInc the global memory location listing the size of the bucket (only 1 atomicInc per unique bucket)

   particle id's are staged in shared memory: write them out at the start value returned from atomicInc and up (makes for decent coalescing on G200)

   done!

Now, I should point out that this method is only faster than the host for somewhat ordered data where there are a lot of elements with the same bucket in the same block to cut down on the number of atomicIncs made. For completely random input data, it is about 4x slower than the host (on 9800 GX2).

Austin · February 11, 2009, 2:54am

It will be a big trouble for me if bucket sort on gpu behaves slower than that running on CPU.

I will take a experiment about the simple method using two pass.

Until now, that is. I finally came up with a method that is equal speed to the host on a single GPU of a 9800 GX2 and ~3-4x faster on a Tesla S1070 (it is so much faster compared to G92 because of the relaxed coalescing requirements). My method works in one pass, but you won’t like that because it means the sort is not stable.

To apply this method and get a stable bucket sort over the whole array, you will need to modify what I do on a block by block level and do it globally (i.e. use radix sort and then a kernel to determine the 0/1 array, then scan the whole array, then stream compact or some other method to put the particles in buckets)

Here is what I do:
launch 1 thread per element

determine bucket of each element

in each block individually:

   bitonic sort elements using the bucket as a key

   in each block individually, make an block_size array of 0's and 1's. Set 1's at the points where the bucket key jumps up

   scan this array to get destination indices

   subtract destination indices to get the number of elements in each bucket local to this block

   atomicInc the global memory location listing the size of the bucket (only 1 atomicInc per unique bucket)

   particle id's are staged in shared memory: write them out at the start value returned from atomicInc and up (makes for decent coalescing on G200)

   done!
Now, I should point out that this method is only faster than the host for somewhat ordered data where there are a lot of elements with the same bucket in the same block to cut down on the number of atomicIncs made. For completely random input data, it is about 4x slower than the host (on 9800 GX2).

Thanks M42 for sharing this method. I will take a good understand about that. External Media

Austin · February 11, 2009, 3:08am

There are two papers about BUCKET SORT. I don’t know if they have anything about your case.Maybe they can offer a mean or idea.

Enclosed please find the papers. External Media

Simon_Green · February 11, 2009, 5:54pm

Interesting post, Mister Anderson.

Note that is basically what the particles sample in the SDK does, except it does a global radix sort on the bucket id and then compares each element in the sorted list to find the start of each bucket.

I do like the idea of doing a bitonic sort per block, though (to avoid the atomic writes).

In my experience it is definitely faster doing the sort on the GPU than doing it on the CPU and transferring the data back and forth.

MisterAnderson42 · February 11, 2009, 8:08pm

Thanks.

Yep, that is where I got the idea from (that and SPWorley reminding me at least several times that sorting is how pretty much everyone else has tackled binning). I guess I forgot to give credit where it is due!

My experiences are only in a very narrow portion of parameter space. I’ve got a particular application in mind: in the range of ~50,000 particles, ~3,000 bins and a particular typical arrangement of these particles (they are sorted so particles near each other in space are near each other in memory). It could very well be that over a wider parameter space global sorting wins overall.

Austin · February 17, 2009, 2:50am

GUYS,

   I want to make a bucket sort on GTX 8800 card. Let me put up the details of bucket sort according to my application.

   There is an array:  a[1000000], all the 1000000 datas can be sorted into M buckets according to my situation.What i want to do is just devide all the datas into M buckets, at the same time must  guarantee the order,or the stability.

   For example:  a[10]={1,2,5,6,3,4,8,7,9,10};

                    result:

                        bucket[1]={1,5,3,7,9};

                        bucket[2]={2,6,4,8,10};

Any tips or advices would be much appreciated!THX. External Media

Result: Using the simplest bucket sort method implemented on 1000000 data can be faster about ~1 times than that’s on CPU accroding to my case. External Media

zarnick · January 11, 2010, 2:55pm

Hello Austin, could you post your code for the bucket sorting? I’m also trying to implement it.

Thanks.

eelsen · January 11, 2010, 6:49pm

GUYS,

   I want to make a bucket sort on GTX 8800 card. Let me put up the details of bucket sort according to my application.

   There is an array:  a[1000000], all the 1000000 datas can be sorted into M buckets according to my situation.What i want to do is just devide all the datas into M buckets, at the same time must  guarantee the order,or the stability.

   For example:  a[10]={1,2,5,6,3,4,8,7,9,10};

                    result:

                        bucket[1]={1,5,3,7,9};

                        bucket[2]={2,6,4,8,10};

Any tips or advices would be much appreciated!THX. External Media

The really easy way to do this would be to use thrust. See http://code.google.com/p/thrust/wiki/Tutorial for a general introduction. These slides: http://thrust.googlecode.com/files/IntroductionToThrust.pdf have starting on page 21 pretty much the exact algorithm you want (except that instead of mapping a float2 to a bucketid, you would map from int to int).

Austin · January 12, 2010, 2:32am

Hello zarnick , I just use the radixsort.cu,radixsort.cuh,and radixsort_kernel.cu. Using the key in KeyValuePair struct as the bucket id. Good luck for you!

It’s so long time for this post, maybe there are other better methods now, whatever, good luck!

zarnick · January 12, 2010, 9:23am

For a very simple bucket sort, I’ve made this program, it uses bubble sort for sorting each bucket, and all numbers have to be greater or equal to 0 up to SIZE. What I’m failling to understand is that I can’t use a SIZE greater then 1024, if I use for instance the double I get “cudaMalloc: out of memory”, however I’ve made a bitonic sort that allocates memory exactly the same (but it’s not a structure) and can allocate up to 2^25, which is 33.554.432 of random numbers. Any tough on what could be wrong?

Here’s the full listing of the code:

include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <errno.h>

#include <cuda.h>

#define SIZE 1024

typedef struct bucket

{

  int amount;

  int numbers;

}bucket;

__global__ void sortBucket(struct bucket *buckets, int maxBuckets)

{

  int idx = blockDim.x*blockIdx.x + threadIdx.x;

  int tmp;

  int swapped = 0;

  if(idx < maxBuckets)

  {

	do{

	  swapped = 0;

	  for(int i = 0; i < buckets[idx].amount-1; i++)

	  {

		if(buckets[idx].numbers[i] > buckets[idx].numbers[i+1])

		{

		  tmp = buckets[idx].numbers[i+1];

		  buckets[idx].numbers[i+1]=buckets[idx].numbers[i];

		  buckets[idx].numbers[i]=tmp;

		  swapped = 1;

		}

	  }

	}while(swapped);

  }

}

int main(int argc, char *argv[])

{

  //Host memory

  int *numbers;

  struct bucket *h_buckets;

  //Device memory

  struct bucket *d_buckets;

size_t memSize = sizeof(int)*SIZE;

  dim3 blockSize(128);

  dim3 qtdBlocks(512);

  int bucketAmount = blockSize.x*qtdBlocks.x;

  int t = 0,k = 0;

  //Error handling and kernel time

  cudaError_t d_error = cudaSuccess;

  cudaEvent_t start,stop;

  float GPU_time;

//Creating events (for catching the time used)

  cudaEventCreate(&start);

  cudaEventCreate(&stop);

//Allocate host memory

  if((numbers=(int*)malloc(memSize))==NULL)

  {

	perror("malloc");

	return 1;

  }

  memset(numbers,0,memSize);

if((h_buckets=(struct bucket*)malloc(bucketAmount*sizeof(struct bucket)))==NULL)

  {

	perror("malloc");

	return 1;

  }

  memset(h_buckets,'

include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <errno.h>

#include <cuda.h>

#define SIZE 1024

typedef struct bucket

{

int amount;

int numbers;

}bucket;

global void sortBucket(struct bucket *buckets, int maxBuckets)

{

int idx = blockDim.x*blockIdx.x + threadIdx.x;

int tmp;

int swapped = 0;

if(idx < maxBuckets)

{

do{

  swapped = 0;

  for(int i = 0; i < buckets[idx].amount-1; i++)

  {

	if(buckets[idx].numbers[i] > buckets[idx].numbers[i+1])

	{

	  tmp = buckets[idx].numbers[i+1];

	  buckets[idx].numbers[i+1]=buckets[idx].numbers[i];

	  buckets[idx].numbers[i]=tmp;

	  swapped = 1;

	}

  }

}while(swapped);

}

int main(int argc, char *argv)

{

//Host memory

int *numbers;

struct bucket *h_buckets;

//Device memory

struct bucket *d_buckets;

size_t memSize = sizeof(int)*SIZE;

dim3 blockSize(128);

dim3 qtdBlocks(512);

int bucketAmount = blockSize.x*qtdBlocks.x;

int t = 0,k = 0;

//Error handling and kernel time

cudaError_t d_error = cudaSuccess;

cudaEvent_t start,stop;

float GPU_time;

//Creating events (for catching the time used)

cudaEventCreate(&start);

cudaEventCreate(&stop);

//Allocate host memory

if((numbers=(int*)malloc(memSize))==NULL)

{

perror("malloc");

return 1;

}

memset(numbers,0,memSize);

if((h_buckets=(struct bucket*)malloc(bucketAmount*sizeof(struct bucket)))==NULL)

{

perror("malloc");

return 1;

}

memset(h_buckets,’

include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <errno.h>

#include <cuda.h>

#define SIZE 1024

typedef struct bucket

{

  int amount;

  int numbers;

}bucket;

__global__ void sortBucket(struct bucket *buckets, int maxBuckets)

{

  int idx = blockDim.x*blockIdx.x + threadIdx.x;

  int tmp;

  int swapped = 0;

  if(idx < maxBuckets)

  {

	do{

	  swapped = 0;

	  for(int i = 0; i < buckets[idx].amount-1; i++)

	  {

		if(buckets[idx].numbers[i] > buckets[idx].numbers[i+1])

		{

		  tmp = buckets[idx].numbers[i+1];

		  buckets[idx].numbers[i+1]=buckets[idx].numbers[i];

		  buckets[idx].numbers[i]=tmp;

		  swapped = 1;

		}

	  }

	}while(swapped);

  }

}

int main(int argc, char *argv[])

{

  //Host memory

  int *numbers;

  struct bucket *h_buckets;

  //Device memory

  struct bucket *d_buckets;

size_t memSize = sizeof(int)*SIZE;

  dim3 blockSize(128);

  dim3 qtdBlocks(512);

  int bucketAmount = blockSize.x*qtdBlocks.x;

  int t = 0,k = 0;

  //Error handling and kernel time

  cudaError_t d_error = cudaSuccess;

  cudaEvent_t start,stop;

  float GPU_time;

//Creating events (for catching the time used)

  cudaEventCreate(&start);

  cudaEventCreate(&stop);

//Allocate host memory

  if((numbers=(int*)malloc(memSize))==NULL)

  {

	perror("malloc");

	return 1;

  }

  memset(numbers,0,memSize);

if((h_buckets=(struct bucket*)malloc(bucketAmount*sizeof(struct bucket)))==NULL)

  {

	perror("malloc");

	return 1;

  }

  memset(h_buckets,'\0',sizeof(struct bucket)*bucketAmount);

//Allocate device memory

  d_error = cudaMalloc((void**)&d_buckets, sizeof(struct bucket)*bucketAmount);

  if(d_error != cudaSuccess)

  {

	fprintf(stderr,"cudaMalloc: %s\n",cudaGetErrorString(d_error));

	return 1;

  }

//Create a random amount of numbers

  srand(time(NULL));

  for(int i = 0; i < SIZE; i++)

	numbers[i]=rand()%SIZE;

//Put into the buckets

  t=k=0;

  for(k=0;k<bucketAmount;k++)

  {

	h_buckets[k].amount=0;

	for(int i = 0; i < SIZE;i++)

	{

	  if(numbers[i] >= k*blockSize.x && numbers[i] < (k+1)*blockSize.x)

	  {

		h_buckets[k].amount++;

		h_buckets[k].numbers[t++]=numbers[i];

		numbers[i]=-1;

	  }

	}

	t=0;

  }

  //Copy to device memory and call kernel

  d_error = cudaMemcpy(d_buckets, h_buckets, sizeof(struct bucket)*bucketAmount,cudaMemcpyHostToDevice);

  if(d_error != cudaSuccess)

  {

	fprintf(stderr,"cudaMemcpy: %s\n",cudaGetErrorString(d_error));

	return 1;

  }

  cudaEventRecord(start,0);

  sortBucket<<<qtdBlocks, blockSize>>>(d_buckets,bucketAmount);

  cudaThreadSynchronize();

  d_error = cudaGetLastError();

  if(d_error != cudaSuccess)

  {

	fprintf(stderr,"sortBucket<<<%d, %d>>>: %s\n",qtdBlocks.x, blockSize.x, cudaGetErrorString(d_error));

	return 1;

  }

  cudaEventRecord(stop,0);

  cudaEventSynchronize(stop);

//Gatter GPU spent time

  cudaEventElapsedTime( &GPU_time, start, stop);

//Copy from device to memory

  d_error = cudaMemcpy(h_buckets, d_buckets, sizeof(struct bucket)*bucketAmount,cudaMemcpyDeviceToHost);

  if(d_error != cudaSuccess)

  {

	fprintf(stderr,"cudaMemcpy: %s\n",cudaGetErrorString(d_error));

	return 1;

  }

//Re-create numbers (arranged)

  k=0;

  for(int i=0;i<bucketAmount;i++)

  {

	for(int j=0; j < h_buckets[i].amount; j++)

	  numbers[k++]=h_buckets[i].numbers[j];

  }

  //Proof

  for(int i = 0; i < SIZE;i++)

	printf("%d ",numbers[i]);

  printf("\n");

  printf("GPU Time:%f ms\n",GPU_time);

//Cleanup

  free(numbers);free(h_buckets);

  cudaFree(d_buckets);

  cudaEventDestroy(start);

  cudaEventDestroy(stop);

return 0;

}

',sizeof(struct bucket)*bucketAmount);

//Allocate device memory

d_error = cudaMalloc((void**)&d_buckets, sizeof(struct bucket)*bucketAmount);

if(d_error != cudaSuccess)

{

fprintf(stderr,"cudaMalloc: %s\n",cudaGetErrorString(d_error));

return 1;

}

//Create a random amount of numbers

srand(time(NULL));

for(int i = 0; i < SIZE; i++)

numbers[i]=rand()%SIZE;

//Put into the buckets

t=k=0;

for(k=0;k<bucketAmount;k++)

{

h_buckets[k].amount=0;

for(int i = 0; i < SIZE;i++)

{

  if(numbers[i] >= k*blockSize.x && numbers[i] < (k+1)*blockSize.x)

  {

	h_buckets[k].amount++;

	h_buckets[k].numbers[t++]=numbers[i];

	numbers[i]=-1;

  }

}

t=0;

}

//Copy to device memory and call kernel

d_error = cudaMemcpy(d_buckets, h_buckets, sizeof(struct bucket)*bucketAmount,cudaMemcpyHostToDevice);

if(d_error != cudaSuccess)

{

fprintf(stderr,"cudaMemcpy: %s\n",cudaGetErrorString(d_error));

return 1;

}

cudaEventRecord(start,0);

sortBucket<<<qtdBlocks, blockSize>>>(d_buckets,bucketAmount);

cudaThreadSynchronize();

d_error = cudaGetLastError();

if(d_error != cudaSuccess)

{

fprintf(stderr,"sortBucket<<<%d, %d>>>: %s\n",qtdBlocks.x, blockSize.x, cudaGetErrorString(d_error));

return 1;

}

cudaEventRecord(stop,0);

cudaEventSynchronize(stop);

//Gatter GPU spent time

cudaEventElapsedTime( &GPU_time, start, stop);

//Copy from device to memory

d_error = cudaMemcpy(h_buckets, d_buckets, sizeof(struct bucket)*bucketAmount,cudaMemcpyDeviceToHost);

if(d_error != cudaSuccess)

{

fprintf(stderr,"cudaMemcpy: %s\n",cudaGetErrorString(d_error));

return 1;

}

//Re-create numbers (arranged)

k=0;

for(int i=0;i<bucketAmount;i++)

{

for(int j=0; j < h_buckets[i].amount; j++)

  numbers[k++]=h_buckets[i].numbers[j];

}

//Proof

for(int i = 0; i < SIZE;i++)

printf("%d ",numbers[i]);

printf(“\n”);

printf(“GPU Time:%f ms\n”,GPU_time);

//Cleanup

free(numbers);free(h_buckets);

cudaFree(d_buckets);

cudaEventDestroy(start);

cudaEventDestroy(stop);

return 0;

}

',sizeof(struct bucket)*bucketAmount);

//Allocate device memory

  d_error = cudaMalloc((void**)&d_buckets, sizeof(struct bucket)*bucketAmount);

  if(d_error != cudaSuccess)

  {

	fprintf(stderr,"cudaMalloc: %s\n",cudaGetErrorString(d_error));

	return 1;

  }

//Create a random amount of numbers

  srand(time(NULL));

  for(int i = 0; i < SIZE; i++)

	numbers[i]=rand()%SIZE;

//Put into the buckets

  t=k=0;

  for(k=0;k<bucketAmount;k++)

  {

	h_buckets[k].amount=0;

	for(int i = 0; i < SIZE;i++)

	{

	  if(numbers[i] >= k*blockSize.x && numbers[i] < (k+1)*blockSize.x)

	  {

		h_buckets[k].amount++;

		h_buckets[k].numbers[t++]=numbers[i];

		numbers[i]=-1;

	  }

	}

	t=0;

  }

  //Copy to device memory and call kernel

  d_error = cudaMemcpy(d_buckets, h_buckets, sizeof(struct bucket)*bucketAmount,cudaMemcpyHostToDevice);

  if(d_error != cudaSuccess)

  {

	fprintf(stderr,"cudaMemcpy: %s\n",cudaGetErrorString(d_error));

	return 1;

  }

  cudaEventRecord(start,0);

  sortBucket<<<qtdBlocks, blockSize>>>(d_buckets,bucketAmount);

  cudaThreadSynchronize();

  d_error = cudaGetLastError();

  if(d_error != cudaSuccess)

  {

	fprintf(stderr,"sortBucket<<<%d, %d>>>: %s\n",qtdBlocks.x, blockSize.x, cudaGetErrorString(d_error));

	return 1;

  }

  cudaEventRecord(stop,0);

  cudaEventSynchronize(stop);

//Gatter GPU spent time

  cudaEventElapsedTime( &GPU_time, start, stop);

//Copy from device to memory

  d_error = cudaMemcpy(h_buckets, d_buckets, sizeof(struct bucket)*bucketAmount,cudaMemcpyDeviceToHost);

  if(d_error != cudaSuccess)

  {

	fprintf(stderr,"cudaMemcpy: %s\n",cudaGetErrorString(d_error));

	return 1;

  }

//Re-create numbers (arranged)

  k=0;

  for(int i=0;i<bucketAmount;i++)

  {

	for(int j=0; j < h_buckets[i].amount; j++)

	  numbers[k++]=h_buckets[i].numbers[j];

  }

  //Proof

  for(int i = 0; i < SIZE;i++)

	printf("%d ",numbers[i]);

  printf("\n");

  printf("GPU Time:%f ms\n",GPU_time);

//Cleanup

  free(numbers);free(h_buckets);

  cudaFree(d_buckets);

  cudaEventDestroy(start);

  cudaEventDestroy(stop);

return 0;

}

_Tom · January 12, 2010, 9:30am

Uhm… i’m trying to implement a simple tesselator and i’ve always struggled with the problem of creating new triangles, because i were not able to assign each triangle a starting index in the micropoligon list according to the extimated mp number…
but this scan algorithm looks exactly what i need.

Do you have any sources on how to implement a fast one?

zarnick · January 12, 2010, 11:01am

What do you mean by scan algorithm? The part to put each number on it’s respective bucket?

_Tom · January 12, 2010, 11:20am

This:

So in my case, each thread/triangle behaves like a bucket computing the extimated primitives to be allocated; so it needs to save its data to the start of its index range.

To do this obviously each triangle has to know how many indexes are reserved by the previous triangles; implementing this with naive atomic operations lead to total serialization of the kernel… not very good External Media

Anyway i don’t need a complete sort, because after the indexes have been assigned, each primitive thread can find its triangle and its raster position with the simple index.

zarnick · January 12, 2010, 11:42am

What I did is in serial code, but it’s very easy adapted to parallel, the main ideia is to actually know what should go in each bucket, you should make a way to make this automatically (of course) based on the amount of things (in my case numbers, in your case triangles) you want to divide and scan, then is just a mater of actually doing it. I remember I did something like this a very long time ago in MPI, the basic ideia (if I’m not mistaken) was:

Create a global placeholder for all the buckets
Create n amount of threads (n is arbitrary)
Each thread would scan part of the whole number-vector-thingy and place it on the correct bucket (watching out for memory overwrite)
When all threads were finished, you would have all your numbers-vector-thingy correcly placed in each bucket
Then you sort.

Hope this helps, I haven’t had the time to make this in CUDA, but is a good exercise.

==EDIT==
A little google brought me to this PDF from nVidia Parallel Prefix Sum (Scan) with CUDA

zarnick · January 12, 2010, 11:59am

What I did is in serial code, but it’s very easy adapted to parallel, the main ideia is to actually know what should go in each bucket, you should make a way to make this automatically (of course) based on the amount of things (in my case numbers, in your case triangles) you want to divide and scan, then is just a mater of actually doing it. I remember I did something like this a very long time ago in MPI, the basic ideia (if I’m not mistaken) was:

Create a global placeholder for all the buckets
Create n amount of threads (n is arbitrary)
Each thread would scan part of the whole number-vector-thingy and place it on the correct bucket (watching out for memory overwrite)
When all threads were finished, you would have all your numbers-vector-thingy correcly placed in each bucket
Then you sort.

Hope this helps, I haven’t had the time to make this in CUDA, but is a good exercise.

Topic		Replies	Views
GTX460 Max of 1 Block? CUDA Programming and Performance	8	7171	July 3, 2011
Faulty sort algorithm. Please help! Odd even sort CUDA Programming and Performance	9	11536	July 26, 2008
Memory problem? ...incredible slowdown CUDA Programming and Performance	29	16308	January 30, 2011
Transfer-Bound Application Looking for ideas to speed it up CUDA Programming and Performance	36	29327	April 23, 2010
multi dimension array CUDA Programming and Performance	26	32777	February 12, 2010
Is it possible to process multidimensional arrays inside the kernel? CUDA Programming and Performance	13	9050	March 31, 2015
GPU Pro Tip: CUDA 7 Streams Simplify Concurrency Technical Blog	51	2130	February 5, 2020
Using Thrust to sort Unified Memory Buffer? GPU-Accelerated Libraries	8	5119	May 7, 2015
Is the GPU better than the CPU for what I want? CUDA Programming and Performance	18	2744	April 3, 2015
Random failures in quick-sort. Compute-sanitizer says there's no error CUDA Programming and Performance	4	24	October 5, 2024

bucket sort on GPU Any ideas?

Related topics