It looks like the dependencies are strictly in one dimension? That is, iv[j1][n] is only dependent on idum[n]? If so, you should be able to vectorize along n without any side effects by vectorizing the inner loop to multiple threads, while leaving the outer loop as is. Otherwise, you’ll have to synchronize at each step of the outer loop, which would only be practical for large values of PATH_N and small values of NTAB.

```
for (int j1 = NTAB + 7; j1 >= 0; j1--) {
for (int n = 0; n < PATH_N; n++)
{
k[n] = idum[n] / IQ;
idum[n] = IA * (idum[n] - k[n] * IQ) - IR * k[n];
if (idum[n] < 0) idum[n] += IM;
}
if (j1 < NTAB)
{
for (int n = 0; n < PATH_N; n++)
{
iv[j1][n] = idum[n];
}
}
}
```

to something like

```
int n = threadIndex;
for (int j1 = NTAB + 7; j1 >= 0; j1--) {
k[n] = idum[n] / IQ;
idum[n] = IA * (idum[n] - k[n] * IQ) - IR * k[n];
if (idum[n] < 0) idum[n] += IM;
if (j1 < NTAB)
{
iv[j1][n] = idum[n];
}
}
```

If you could get rid of the (idum[n] <0) idum+= IM; conditional, it might even be possible to compute idum directly from j1, since it looks like it might reduce to a geometric series or something.

What’s with the IQ, though? (idum[n]-k[n]*IQ) = (idum[n]-(idum[n]/IQ)*IQ) = (idum[n]-idum[n]) = 0 for IQ != 0 and NaN otherwise…