Handle Nested Loop With Variable Loop Bounds

UserHCC · July 15, 2020, 6:53pm

My scientific code has a nested loop with variable loop bounds in the inner loop:

for (i = to 10 million) {
    for (j = 1 to NestedLoopBounds[i]) {
        LoopBody(i, j)
    }
}

I think the best way to parallelize this is to create a CUDA kernel which performs LoopBody(i, j). The only question is how to assign iterations to each kernel instance based on block/thread IDs. One way would be to create two integer arrays. The total size of each array would be the total number of times LoopBody() executes. The first array contains the i-values and the second array contains the j-values. If
NestedLoopBounds[] equals (2, 3, 4, 5)
Then these arrays will look like this:
Array1: 0 0 1 1 1 2 2 2 2 3 3 3 3 3
Array2: 0 1 0 1 2 0 1 2 3 0 1 2 3 4
My question is this: How can I create the above arrays in CUDA?