Newbie to CUDA - Help wanted Suggestions and help with implementing a parallel merge sort

Hi All,

I’m a newbie to CUDA, and I wrote my first application - which is far from being perfect.

As an exercise, I wrote a very simplistic parallel merge sort algorithm.

Here’s the most interesting bits of code:

These are the two kernels used:

// Merges array A (of size szA) & array B (of size szB) into array C (of size szA + szB)

__device__ void Merge(int A[], int B[], int C[], int szA, int szB)


	int a = 0;

	int b = 0;

	int c = 0;

	while (a < szA && b < szB)


		C[c++] = (A[a] < B[b]) ? A[a++] : B[b++];


	while (a < szA) { C[c++] = A[a++]; }

	while (b < szB) { C[c++] = B[b++]; }	


// Sorts array a (of size d) and saves the result into array b (of size d)

__global__ void Sort(int a[], int b[], int d)


	int i = blockDim.x * blockIdx.x + threadIdx.x;		

	int s = 2*d*i;

	Merge(a+s, a+s+d, b+s, d, d);


Here’s the host code which uses these kernels:

for (int d = 1; d <= N/2; d *= 2)


  int numberOfBlocks = N / (2*d);

  Sort<<<numberOfBlocks,1>>>(dev_a, dev_c, d);

  // Swap contents of dev_c & dev_a

  dev_b = dev_a;

  dev_a = dev_c;

  dev_c = dev_b;


As you can see, this implementation assumes that the array size is a power of 2.

I have some problems, which I would very much like for anyone to comment on:

  1. This algorithm works only if array size is as large as 2^15.

    The algorithm fails if the array size is 2^16 or larger.

    Can anyone have any idea why this happens?

  2. Is there anything I can do to make this work faster?

    I was thinking of using the shared memory, at least for the first few iterations,

    but I couldn’t find any way I can use it - since as far as I understand, shared memory is shared between threads on the same block,

    while I’m using only a single thread per block.

    Can anyone think of a better implementation?

  3. I did a comparison between sequential merge sort and the parallel merge sort.

    I got pretty disappointing results -

    The best I could get is for the parallel merge sort to run 9 times(!) slower than the sequential merge sort.

    I was thinking that perhaps the bottleneck might be copying memory between host & device (and vice-versa).

    Do these results make sense? or should I expect better performance?

If anyone is interested, I can post the entire code.


Shay M.

I’d really recommend you carefully read through the programming guide and best practices guide, then investigate how samples work.

  1. max number of blocks allowed in a grid is 2^16
  2. you use only 1 thread out of 32 threads in a warp, other 31 threads just do not do anything
  3. really - read all the manuals first and study samples