bitonic sort for arbitrary number of threads

the bitonic sort example is implemented for N number of threads being the same as N number of elems to sort.

how would one go at sorting a list of n arbitrary elements where N is nbr_threads^x? for x = a power of 2 ?? for x = an arbitrary number ?? and
where N is just a multiple of nbr_threads(N=x*nbr_threads), is it still possible to get an efficient implementation, even if not optimal?

I tried the following but get incorrect sorting

int n_elems_per_thread = TOTAL_NUM_ELEMS/blockDim.x;
for (i=0;i<n_per_thread;i++)
/* Parallel bitonic sort./
int current_i=tid
n_per_thread+i; // then current_i replaces tid in bitonic example of SDK

Thank you