Bitonic-Sorting Networks CUDA Sample help.

Hello I am trying to understand how the Sorting Networks works but the comments in the code are minimal and it wont help me much. Is there anyone in here who could upload the same code with comments or a link to an actual description of the code? (I know that the theory of bitonic sort is this one here http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/bitonic/bitonicen.htm but its not helping in understanding the code)

I am talking about this sample code:
http://docs.nvidia.com/cuda/cuda-samples/index.html#cuda-sorting-networks

I am supposed to modify this code to actually sort an array of float2 struct but I am really confused. Any help would be nice.
Thanks in advance.

I do not have such a commented algorithm.

Is it mandatory for you to write your own implementation of a sorting algorithm? An alternative would be to use thrust. The following thread on StackOverflow could be of interest to you:

http://stackoverflow.com/questions/7062058/bitonic-sorting-network-vs-thrustsort-by-key

Yes I know that thrust is optimized but the implementation is not focused on the sorting speed, its not what i want to examine etc. I just want to customize the bitonic sorting networks along with the odd/even merge sort that is also implemented in the sorting networks.

I have modified the bitonicsort kernels to accept one more variable that I need so I can sort as of x or as of y from the float2 Array of Structs that I need for a university project. The problem is that I get weird errors and I dont have a clue what might be the fault. I have modified the comparator() function to also get that extra variable and I have correctly added the variable to the declaration of the functions. Here is a part of the code.

__global__ void bitonicSortShared(
    float2 *d_P_out,
    float2 *d_P_in,
    uint arrayLength,
    uint dir,
	uint xy )
{
    //Shared memory storage for one or more short vectors
    __shared__ float2 s_key[SHARED_SIZE_LIMIT];

    //Offset to the beginning of subbatch and load data
    d_P_in  += blockIdx.x * SHARED_SIZE_LIMIT + threadIdx.x;
    d_P_out += blockIdx.x * SHARED_SIZE_LIMIT + threadIdx.x;
    s_key[threadIdx.x +                       0] = d_P_in[                      0];
    s_key[threadIdx.x + (SHARED_SIZE_LIMIT / 2)] = d_P_in[(SHARED_SIZE_LIMIT / 2)];

    for (uint size = 2; size < arrayLength; size <<= 1){
        //Bitonic merge
        uint ddd = dir ^ ((threadIdx.x & (size / 2)) != 0);

        for (uint stride = size / 2; stride > 0; stride >>= 1) {
            __syncthreads();
            uint pos = 2 * threadIdx.x - (threadIdx.x & (stride - 1));
            Comparator(	s_key[pos +  0], s_key[pos + stride], ddd, xy );
        }
    }

    //ddd == dir for the last bitonic merge step
    {
        for (uint stride = arrayLength / 2; stride > 0; stride >>= 1) {
            __syncthreads();
            uint pos = 2 * threadIdx.x - (threadIdx.x & (stride - 1));
            Comparator(	s_key[pos +  0], s_key[pos + stride], dir, xy );
        }
    }
	__syncthreads();
    d_P_out[                      0] = s_key[threadIdx.x +                       0];
    d_P_out[(SHARED_SIZE_LIMIT / 2)] = s_key[threadIdx.x + (SHARED_SIZE_LIMIT / 2)];
}
__device__ inline void Comparator(
    float2 &keyA,
    float2 &keyB,
    uint dir,
	uint xy )
{
    float2 t;

	if (xy == 0){
		if ((keyA.x > keyB.x) == dir) {
			t = keyA;
			keyA = keyB;
			keyB = t;
		}
	else{
		if ((keyA.y > keyB.y) == dir) {
			t = keyA;
			keyA = keyB;
			keyB = t;
		}
	}
}

First, I have tried with boole variable xy but i thought that it might have an issue with boolean variables. This seems not to be the case. Some errors I get are like:

Error	2	error : expected a ";"	
Error	4	error : explicit type is missing ("int" assumed)...	
Error	5	error : cannot overload functions distinguished by return type alone	...
Error	6	error : the size of an array must be greater than zero	...
Error	7	error : identifier "s_key" is undefined	...
Error	8	error : this declaration has no storage class or type specifier	...
Error	9	error : variable "d_P_out" has already been defined...

and a lot more… Any ideas what might be the problem here?
Thanks in advance!