I have small problem. I have 2 array that have the same size n.

x1 x2 … xn (x can be any type)

N1 y2 …Nn (y is integer)

I want to make a new array that have length N1 + N2 + … Nn. with the contain like

(x1 x1 … x1) (x2 …x2) . … (xn…xn)

with N1 x1, N2 x2 and so on. So I first to do smth like this

- Compute prefixsum of N array, callel pCN. That also give the total sum N_i
- Allocate memory for the output
- For i = 1 to N parallel {

d = x[i]

For (j =0; j < Ni; ++j)

ou[pcN[i]+j] = d;

}

So everything inside { … } would go to the kernel. I assign i to each thread, that simple.

But now it raises issues about performance:

- Inputs are coelased but outputs are not, it 's horribly slow
- Normally Ni is different that mean unequal load to each threads. As i know, the thread block will be released only when all threads are completed, that mean if i have one big Ni and alot small Nj, then all Js threads should wait I thread to complete.
- Although CUDA can handle large number of loop, having a big Ni is something unwanted

So is there any idea about how to solve this performance problem. I think there’s still a lot more similar small problems that prevent us from using CUDA as efficient computational solution for many problems.

Any idea is appreciated ? Thank you