Hi All,
I’m a newbie to CUDA, and I wrote my first application - which is far from being perfect.
As an exercise, I wrote a very simplistic parallel merge sort algorithm.
Here’s the most interesting bits of code:
These are the two kernels used:
// Merges array A (of size szA) & array B (of size szB) into array C (of size szA + szB)
__device__ void Merge(int A[], int B[], int C[], int szA, int szB)
{
int a = 0;
int b = 0;
int c = 0;
while (a < szA && b < szB)
{
C[c++] = (A[a] < B[b]) ? A[a++] : B[b++];
}
while (a < szA) { C[c++] = A[a++]; }
while (b < szB) { C[c++] = B[b++]; }
}
// Sorts array a (of size d) and saves the result into array b (of size d)
__global__ void Sort(int a[], int b[], int d)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
int s = 2*d*i;
Merge(a+s, a+s+d, b+s, d, d);
}
Here’s the host code which uses these kernels:
for (int d = 1; d <= N/2; d *= 2)
{
int numberOfBlocks = N / (2*d);
Sort<<<numberOfBlocks,1>>>(dev_a, dev_c, d);
// Swap contents of dev_c & dev_a
dev_b = dev_a;
dev_a = dev_c;
dev_c = dev_b;
}
As you can see, this implementation assumes that the array size is a power of 2.
I have some problems, which I would very much like for anyone to comment on:
-
This algorithm works only if array size is as large as 2^15.
The algorithm fails if the array size is 2^16 or larger.
Can anyone have any idea why this happens?
-
Is there anything I can do to make this work faster?
I was thinking of using the shared memory, at least for the first few iterations,
but I couldn’t find any way I can use it - since as far as I understand, shared memory is shared between threads on the same block,
while I’m using only a single thread per block.
Can anyone think of a better implementation?
-
I did a comparison between sequential merge sort and the parallel merge sort.
I got pretty disappointing results -
The best I could get is for the parallel merge sort to run 9 times(!) slower than the sequential merge sort.
I was thinking that perhaps the bottleneck might be copying memory between host & device (and vice-versa).
Do these results make sense? or should I expect better performance?
If anyone is interested, I can post the entire code.
Thanks,
Shay M.