Reorder a vector

Hi all! I recently finished my project in CUDA(I do not have the correct output yet, but the code is all written). However there is only one function which I cannot “translate” to CUDA or Thrust.

The function takes a vector and reorders it starting from zero and putting all the elements consecutive, for example:

[45, 45 ,46, 50, 50, 101, 101] → [0,0,1,2,2,3,3]

The code I used for CPU code

int counter=binh[0],auxiliar=binh[0];

	for(int i=1;i<dim;i++){

		if(binh[i]>binh[i-1] && binh[i]>auxiliar){

			counter++;

			auxiliar=binh[i];

			binh[i]=counter-binh[0];

			

		}

		else

			binh[i]=counter-binh[0];

	}

	binh[0]=0;

Could somebody help me, as this makes my project 2 or 3 times slower =/. Thanks in advance

I don’t understand why you need the variable auxiliar.

I suppose the vector is already in sorted order. If this is true, you will just have to break your vector (if it’s long enough) into multiple segments. Let each block handle a segment. Consecutive threads handle consecutive elements. The code you’ve listed could probably be reused. You won’t get much trouble in terms of performance because of the L1 cache.

When all blocks are done with their own segments, you’ll let them all exit and launch another kernel to correct the offset that each segment has (find the value for binh[0], and then increment binh[1:n] by binh[0]).

I tried reworking the CUDA code for this program that I had made, but I cannot make it…

Example: with the vector [2,2,10,10,15] given that the program is parallel , when I am comparing the elements I can get 10, 10 to transform into 3,3, but then when I compare 15, it compares with 10 and not with 3, so 15 will end up being 11. And I can’t thnk of a way to make it work. I tried working around with shared memory to copy to a temporary vector, but I will have the same problems again :S

Could you give an example of an input vector and the output after the first kernel you mentioned? From what you have said I am assuming your program would work like this:

Kernel with 2 blocks of 5 threads. Input vector: [2,2,10,10,15,27,27,27,40,40]

Ouput after first kernel: [2,2,3,3,4] [27,27,27,28,28] and then in the second kernel one would make the transformations needed.

But I do not understand how I end up with 4 instead of 11 in the first block’s segment External Image