Repeat vector element in cuda like MATLAB's repmat

tommyecguitar · October 16, 2021, 1:38am

Hi.

I would like to make MATLAB’s repelem function in cuda. But I don’t have an idea to do.
It works like this.

v = [0, 2, 1, 3, -3];
r = [1, 4, 2, 5, 3];
repmat(v, r) = [0, 2, 2, 2, 2, 1, 1, 3, 3, 3, 3, 3, -3, -3, -3];

njuffa · October 16, 2021, 2:20am

Off the top of my head: To make the filling of the output array easier to parallelize, you would want to apply a prefix sum to the vector r. Second thought: Have you checked whether Thrust already offers the functionality you are seeking to implement?

For further discussion it might be good to have information regarding the typical length of vector v, and the typical length of runs specified via vector r. In other words, the “expansion factor” between v and repmat(v, r).

tommyecguitar · October 16, 2021, 3:34am

Thank you @njuffa for explaining the first steps for me as a beginner.

What I understand about this assignment is that prefix-sum is needed to find the output size, that v and r are the same size, and that their sizes are not constants.

I don’t know how to determine which element of v y[i] should adopt in y = repelem(v, r), i.e., j in y[i] = v[j].

njuffa · October 17, 2021, 5:12am

A trivial approach: thread i computes start = (i==0) ? 0 : prefix_sum (r[i-1]) and stop = prefix_sum(r[i]) - 1, grabs v[i], and stores it to output [start ... stop]. The prefix sum vector for the example is [1, 5, 7, 12, 15].

So thread 0 grabs v[0] and stores it to output[0], thread 1 grabs v[1] and stores it to output[1...4], thread 2 grabs v[2] and stores it to output [5...6], etc, etc.

This would be functional and sort of OK from a performance perspective if there are plenty of elements in v, and the expansion factor is small. A prefix sum is a commonly used coding idiom; you might want to Google for “prefix sum” plus “CUDA” if not familiar with it.

A better performing approach exposing more parallelism and with improved memory access pattern for the output vector would assign one thread to each element of the output vector. Left as an exercise to the reader.

tommyecguitar · October 17, 2021, 5:29am

Thank you.

There are two ways to launch threads, either with the same size as the input vector, or with the same size as the output size, and the way you showed is the latter, which is very simple.

I was looking for the latter method as I was looking for more parallelism, but I will check the processing time with your method first.

njuffa · October 17, 2021, 5:35am

It is always a good idea to state the apporache already tried when asking a question. Your original post stated you had no idea. I supplied an idea. Now it is up to you to work out the best possible solution for your assignment.

tommyecguitar · October 17, 2021, 10:32pm

I’m sorry. I should have described the question exactly as you said. However, your answer is very helpful. You have shown me that other solutions are difficult to find. Thank you.

striker159 · October 18, 2021, 6:15am

The algorithm you are looking for is called run-length decoding. There is an example program in thrust which shows how to do it it CUDA. thrust/run_length_decoding.cu at main · NVIDIA/thrust · GitHub

tommyecguitar · October 19, 2021, 1:25am

Thank you @striker159 .

system · November 2, 2021, 1:26am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
convert Matlab array multiplication and sum function to CUDA equivalent CUDA Programming and Performance	12	13142	August 17, 2010
Parallelize function which will count all vectors with sum equal of vector elements and elements not CUDA Programming and Performance	1	679	October 19, 2013
How to add pointer array value CUDA Programming and Performance	13	1759	May 2, 2019
How to put specific elements from one array to another array use CUDA? CUDA Programming and Performance cuda	6	1452	October 30, 2022
Thinking parallel CUDA Programming and Performance	5	1956	January 17, 2010
Basic summation of vectors for R CUDA Programming and Performance	2	719	December 6, 2016
total sum example CUDA Programming and Performance	3	7230	December 2, 2015
HELP with vector sum CUDA Programming and Performance	6	2227	May 11, 2010
Reorder a vector CUDA Programming and Performance	2	771	July 8, 2011
Learning by coding recursive sum using dynamic parallelism CUDA Programming and Performance	2	728	January 17, 2018

Repeat vector element in cuda like MATLAB's repmat

Related topics