Small interesting problem Performance issue concern

I have small problem. I have 2 array that have the same size n.
x1 x2 … xn (x can be any type)
N1 y2 …Nn (y is integer)
I want to make a new array that have length N1 + N2 + … Nn. with the contain like
(x1 x1 … x1) (x2 …x2) . … (xn…xn)
with N1 x1, N2 x2 and so on. So I first to do smth like this

  • Compute prefixsum of N array, callel pCN. That also give the total sum N_i
  • Allocate memory for the output
  • For i = 1 to N parallel {
    d = x[i]
    For (j =0; j < Ni; ++j)
    ou[pcN[i]+j] = d;

So everything inside { … } would go to the kernel. I assign i to each thread, that simple.
But now it raises issues about performance:

  • Inputs are coelased but outputs are not, it 's horribly slow
  • Normally Ni is different that mean unequal load to each threads. As i know, the thread block will be released only when all threads are completed, that mean if i have one big Ni and alot small Nj, then all Js threads should wait I thread to complete.
  • Although CUDA can handle large number of loop, having a big Ni is something unwanted

So is there any idea about how to solve this performance problem. I think there’s still a lot more similar small problems that prevent us from using CUDA as efficient computational solution for many problems.

Any idea is appreciated ? Thank you

How about letting blocks do the implicit loop over i:

__shared__ d;

if (threadIdx.x < 0)

  d = x[blockIdx.x];


if (threadIdx.x < Ni)

    out[pcN[blockIdx.x] + threadIdx.x] = d;

Now you just have to start the kernel with N blocks, and max(Ni) threads.

Thank you, actually I did think about this, but with a fix number of threads (that is a multiply of 32 as usual) , because we don’t know how big Ni. However there’s still the major problem, the non-coalesced issue that is the main bottleneck and i still wonder is there any better way to approach problem because the problem sounds very simple, the CPU code is very very simple and is never a bottleneck. With non-coalesced access many parallel solution become sluggish in comparison to CPU one