I have small problem. I have 2 array that have the same size n.
x1 x2 … xn (x can be any type)
N1 y2 …Nn (y is integer)
I want to make a new array that have length N1 + N2 + … Nn. with the contain like
(x1 x1 … x1) (x2 …x2) . … (xn…xn)
with N1 x1, N2 x2 and so on. So I first to do smth like this
- Compute prefixsum of N array, callel pCN. That also give the total sum N_i
- Allocate memory for the output
- For i = 1 to N parallel {
d = x[i]
For (j =0; j < Ni; ++j)
ou[pcN[i]+j] = d;
}
So everything inside { … } would go to the kernel. I assign i to each thread, that simple.
But now it raises issues about performance:
- Inputs are coelased but outputs are not, it 's horribly slow
- Normally Ni is different that mean unequal load to each threads. As i know, the thread block will be released only when all threads are completed, that mean if i have one big Ni and alot small Nj, then all Js threads should wait I thread to complete.
- Although CUDA can handle large number of loop, having a big Ni is something unwanted
So is there any idea about how to solve this performance problem. I think there’s still a lot more similar small problems that prevent us from using CUDA as efficient computational solution for many problems.
Any idea is appreciated ? Thank you