bitonicsort slow at self-defined struct bitonic sort

hi, bitonicsort is slower than an Intel P4 3.2GHz on struct
{
int a;
int b[2];
}
any suggestion to speed it up? thanks!

maybe try aligning it to 16 bytes; or try another sort algorithm, like the radix/mergesort one in CUDPP