I am pretty new to this CUDA thing myself, so my answer may be off…
Initially, I assumed that you are scheduling more than 1 block simultaneously, i.e. your array is bigger than 512 elements. But this is apparently not be the case when inspecting your code closely. I am not seeing any use of the blockIdx variable.
Your definition of idx differs inside and outside the while() loop.
int idx=threadIdx.x;
int idx=threadIdx.y*blockDim.x+threadIdx.x;
This confuses me. The second definition of idx suggests a 2-dimensional thread block. The only way I see this working if your blockDim.y is exactly 2 and blockDim.x is half the number of array elements per thread block.
But for each threadIdx.x there are two threads running, one of which would not be required for sorting, just for producing the out array. Isn’t this wasting a lot of GPU resources?
Also there might be race conditions among the two threads with identical threadIdx.x (between the comparison and the actual swap). For example because both instances initially find a reason to swap and both threads then swap the values. ;) This may be your sorting bug.
Then there’s one more problem that concerns your boolean state variables. How do you make sure that you only terminate when ALL threads report no swapping? Right now you set the swap states to false quite independently in each indidual thread. This may cause quite some confusion. You need to have some means of determining that none of the threads have swapped during the last iteration. Maybe some atomic counters instead of booleans where you count the number of non-swap events? But my CUDA knowledge is not enough to help you with this. Also it is my understanding that the termination variable must be in a block-shared scope, not thread-local.
If swappedodd and swappedeven were shared variables AND you also did a __synchthreads() after setting it to false, this might work. Any expert opinions on this?
The following applies if you intend to work on large arrays of >512 elements that require sending many blocks to the GPU. Isn’t it that each thread block works separately and independently on its assigned Streaming Multiprocessor? Hence each block will independently evaluate the termination condition - that means each block would stop when the contained elements are sorted, but the check may fail when checking across block boundaries.
When two independent thread blocks actually try to compare values across the block boundary, this will not happen in a synchronized manner because the __syncthreads() intrinsic only synchronizes all the threads belonging to the same thread block - but never across blocks. So for example the neighbor block may not even have started sorting yet when you perform the comparison.
Termination of the entire kernel forces synchronization across all thread blocks, so the only solution I see is to make an entire new invocation of the kernel. My take is that you will need an outer loop across several (possibly MANY) kernel invocation to correctly sort (and swap) element pairs that are crossing block boundaries. By offsetting the block boundaries on each kernel invocation (similar to the odd/even approach) you may get better results.