Hi.

I would like to make MATLAB’s repelem function in cuda. But I don’t have an idea to do.

It works like this.

v = [0, 2, 1, 3, -3];

r = [1, 4, 2, 5, 3];

repmat(v, r) = [0, 2, 2, 2, 2, 1, 1, 3, 3, 3, 3, 3, -3, -3, -3];

Hi.

I would like to make MATLAB’s repelem function in cuda. But I don’t have an idea to do.

It works like this.

v = [0, 2, 1, 3, -3];

r = [1, 4, 2, 5, 3];

repmat(v, r) = [0, 2, 2, 2, 2, 1, 1, 3, 3, 3, 3, 3, -3, -3, -3];

Off the top of my head: To make the filling of the output array easier to parallelize, you would want to apply a prefix sum to the vector `r`

. Second thought: Have you checked whether Thrust already offers the functionality you are seeking to implement?

For further discussion it might be good to have information regarding the typical length of vector `v`

, and the typical length of runs specified via vector `r`

. In other words, the “expansion factor” between `v`

and `repmat(v, r)`

.

Thank you @njuffa for explaining the first steps for me as a beginner.

What I understand about this assignment is that prefix-sum is needed to find the output size, that v and r are the same size, and that their sizes are not constants.

I don’t know how to determine which element of v y[i] should adopt in y = repelem(v, r), i.e., j in y[i] = v[j].

A trivial approach: thread `i`

computes `start = (i==0) ? 0 : prefix_sum (r[i-1])`

and `stop = prefix_sum(r[i]) - 1`

, grabs `v[i]`

, and stores it to `output [start ... stop]`

. The prefix sum vector for the example is [1, 5, 7, 12, 15].

So thread 0 grabs `v[0]`

and stores it to `output[0]`

, thread 1 grabs `v[1]`

and stores it to `output[1...4]`

, thread 2 grabs `v[2]`

and stores it to output `[5...6]`

, etc, etc.

This would be functional and sort of OK from a performance perspective if there are plenty of elements in `v`

, and the expansion factor is small. A prefix sum is a commonly used coding idiom; you might want to Google for “prefix sum” plus “CUDA” if not familiar with it.

A better performing approach exposing more parallelism and with improved memory access pattern for the output vector would assign one thread to each element of the *output* vector. Left as an exercise to the reader.

Thank you.

There are two ways to launch threads, either with the same size as the input vector, or with the same size as the output size, and the way you showed is the latter, which is very simple.

I was looking for the latter method as I was looking for more parallelism, but I will check the processing time with your method first.

It is always a good idea to state the apporache already tried when asking a question. Your original post stated you had no idea. I supplied an idea. Now it is up to you to work out the best possible solution for your assignment.

I’m sorry. I should have described the question exactly as you said. However, your answer is very helpful. You have shown me that other solutions are difficult to find. Thank you.

The algorithm you are looking for is called run-length decoding. There is an example program in thrust which shows how to do it it CUDA. thrust/run_length_decoding.cu at main · NVIDIA/thrust · GitHub

1 Like

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.