Things working in EMUdebug and not in debug is a sign that the code, as it is, isnt fully parallelised.
In EMUdebug only 1 thread runs at a time and the threads are run in order. So thread 0 will start and run until it reaches a __synchthread() or some other point where it has to stop, then thread 1 will run again to the same point, then thread 2, …
In Debug mode threads run in parrallel
So you are actually moving ( idx < 16) values simultaneously, at some point 2 or more threads are moving the same value and the result is it is duplicated (the different threads have different strides)
I think you need a substantially different algoritm for sorting in parrallel.
Things working in EMUdebug and not in debug is a sign that the code, as it is, isnt fully parallelised.
In EMUdebug only 1 thread runs at a time and the threads are run in order. So thread 0 will start and run until it reaches a __synchthread() or some other point where it has to stop, then thread 1 will run again to the same point, then thread 2, …
In Debug mode threads run in parrallel
So you are actually moving ( idx < 16) values simultaneously, at some point 2 or more threads are moving the same value and the result is it is duplicated (the different threads have different strides)
I think you need a substantially different algoritm for sorting in parrallel.