the bitonic sort example is implemented for N number of threads being the same as N number of elems to sort.
how would one go at sorting a list of n arbitrary elements where N is nbr_threads^x? for x = a power of 2 ?? for x = an arbitrary number ?? and
where N is just a multiple of nbr_threads(N=x*nbr_threads), is it still possible to get an efficient implementation, even if not optimal?
I tried the following but get incorrect sorting
int n_elems_per_thread = TOTAL_NUM_ELEMS/blockDim.x;
/* Parallel bitonic sort./
int current_i=tidn_per_thread+i; // then current_i replaces tid in bitonic example of SDK