Newbie to CUDA - Help wanted Suggestions and help with implementing a parallel merge sort

shay_m · December 23, 2010, 12:09pm

Hi All,

I’m a newbie to CUDA, and I wrote my first application - which is far from being perfect.

As an exercise, I wrote a very simplistic parallel merge sort algorithm.

Here’s the most interesting bits of code:

These are the two kernels used:

// Merges array A (of size szA) & array B (of size szB) into array C (of size szA + szB)

__device__ void Merge(int A[], int B[], int C[], int szA, int szB)

{

	int a = 0;

	int b = 0;

	int c = 0;

	while (a < szA && b < szB)

	{

		C[c++] = (A[a] < B[b]) ? A[a++] : B[b++];

	}

	while (a < szA) { C[c++] = A[a++]; }

	while (b < szB) { C[c++] = B[b++]; }	

}

// Sorts array a (of size d) and saves the result into array b (of size d)

__global__ void Sort(int a[], int b[], int d)

{

	int i = blockDim.x * blockIdx.x + threadIdx.x;		

	int s = 2*d*i;

	Merge(a+s, a+s+d, b+s, d, d);

}

Here’s the host code which uses these kernels:

for (int d = 1; d <= N/2; d *= 2)

{

  int numberOfBlocks = N / (2*d);

  Sort<<<numberOfBlocks,1>>>(dev_a, dev_c, d);

  // Swap contents of dev_c & dev_a

  dev_b = dev_a;

  dev_a = dev_c;

  dev_c = dev_b;

}

As you can see, this implementation assumes that the array size is a power of 2.

I have some problems, which I would very much like for anyone to comment on:

This algorithm works only if array size is as large as 2^15.

The algorithm fails if the array size is 2^16 or larger.

Can anyone have any idea why this happens?
Is there anything I can do to make this work faster?

I was thinking of using the shared memory, at least for the first few iterations,

but I couldn’t find any way I can use it - since as far as I understand, shared memory is shared between threads on the same block,

while I’m using only a single thread per block.

Can anyone think of a better implementation?
I did a comparison between sequential merge sort and the parallel merge sort.

I got pretty disappointing results -

The best I could get is for the parallel merge sort to run 9 times(!) slower than the sequential merge sort.

I was thinking that perhaps the bottleneck might be copying memory between host & device (and vice-versa).

Do these results make sense? or should I expect better performance?

If anyone is interested, I can post the entire code.

Thanks,

Shay M.

sergeyn · December 23, 2010, 12:28pm

I’d really recommend you carefully read through the programming guide and best practices guide, then investigate how samples work.

max number of blocks allowed in a grid is 2^16
you use only 1 thread out of 32 threads in a warp, other 31 threads just do not do anything
really - read all the manuals first and study samples