Hi all! I’ve been searching a lot, but nothing helps me with my problem. Please try to follow despite the length of my post (if tl;dr you can jump directly to the kernel function and see the explanation of it later).

I have arrays Z and Chi of length anywhere from 1 to several thousands. I need to solve equations for Z and Chi iteratively (that is, I have equation for Z = RHS1 and Chi = RHS2 where RHS1(2) contains both Z and Chi - coupled quantities) and the fixed-point method is ideal for it. The equations are of the form

Z (w) = (const.) sum_w’ (w’ Z(w’)) (g(w - w’) - g(w + w’)) integral_x dx / ((w’ Z (w’))^2 + (x + Chi(x))^2)

Chi(x) = (const.) integral_x’ dx’ (x’ + Chi(x’)) (mu(x-x’) - mu(x+x’)) sum_w 1 / ((w Z (w))^2 + (x’ + Chi(x’))^2)

So as you can see, there are nested loops involved - the outermost loop takes care of a single Z/Chi value, while inner loops calculate sum and integral (which can’t be separated, but they can be simplified as I hinted on moving integral and/or sum far to the right as it gets).

I defined double *Z, *Chi, *Z_new, *Chi_new. Additionally, I had to define double *g, *x, *mu1 and *mu2, as the equation for Z has a function g in it and Chi uses function mu.

I pre-calculated a discretization of the integral in x, I call it x[e] (array of values in which we calculate Chi). I also precalculated function g which has argument w - w’, and values of w are on a uniform grid (so g(w-w’) can be remembered as g[abs(w-w’)] and g(w+w’) as g[w+w’+1] - only 1D array needed).

I had to precalculate function mu which is needed for Chi but due to the uneven discretization of integral, I have to keep track of mu(e-e’) and mu (e+e’) for every combination of e and e’. So I have two arrays, mu1 and mu2 and to access mu(e-e’) I have mu1[M*e+e’] and mu(e+e’) is mu2[M*e+e’]. Maybe this sounds crazy…

If the number of w points (those in which Z is calculated) is N, array Z is of length N, number of e points is M, so array Chi is of length M. Array g is of length 2*N+1 (the largest needed index is 2*N as we have to calculate g[w+w’+1]). Finally, mu1 and mu2 are arrays of length M*M.

Now to the equations: the kernel for Z looks like

**device** double __pi = 3.14159265358979;

**global** void Iterate_Z_normal(double *Z_new, const double *Z, const double *Chi, const double *g, const double *x, const double T, const int N, const int M) {

int index = blockIdx.x * blockDim.x + threadIdx.x;

int stride = blockDim.x * gridDim.x;

```
for (int n = index; n < N; n += stride) {
double znew = 0.;
double wn = (2 * n + 1)*__pi*T;
for (int m = 0; m < N; m++) {
double wm = (2 * m + 1)*__pi*T;
double foo = 1. / (Z[m] * Z[m] * wm * wm);
double faZ = (2. * wm / wn)*(g[abs(n - m)] - g[n + m + 1])*foo*Z[m];
for (int e_prime = 1; e_prime < M; e_prime++) {
double dx = abs(x[e_prime] - x[e_prime - 1]);
foo = 1. / (Z[m] * Z[m] * wm * wm + (x[e_prime] + Chi[e_prime]) * (x[e_prime] + Chi[e_prime]));
double fbZ = (2 * wm / wn)*(g[abs(n - m)] - g[n + m + 1])*foo*Z[m];
znew += 0.5*T*dx*(faZ + fbZ);
faZ = fbZ;
}
}
Z_new[n] = 1 + znew;
```

}

}

and I call it with

Iterate_Z_normal << <numBlocks, threadsPerBlock >> > (Z_new, Z, Chi, g, x, T, N, M);

where threadsPerBlock is some multiply of 32 and numBlocks is ceil((double)(N + threadsPerBlock - 1) / threadsPerBlock).

Not that it matters, but after each iteration step I wait for the device synchronization and call another kernel which copies Z_new to Z and Chi_new to Chi. Iterations stop when the difference between new and old values gets below prescribed threshold.

By the way, I’m using unified memory, so I’m calling cudaMallocManaged for Z, Z_new (N*sizeof(double)), Chi, Chi_new, x (M*sizeof(double)), g ((2*N+1) sizeof(double)), mu1, mu2 (MM*sizeof(double)).

Example: if N = 160 and threadsPerBlock = 32 then numBlocks = 6. This all works well if M (size of Chi, so basically the size of the innermost loop) is not very large. But if M = ~2000 or so, after this kernel is executed (and cudaDeviceSynchronize() is called), all arrays are INVALID (even those that are passed as const double*), i.e. I’m getting read access violation for cout, or any reasonable further manipulation of either Z, Chi, Z_new etc…my suspicion is that something gets overflown in the kernel, because the inner loop is very long and/or arrays mu1, mu2 are very long (millions of elements). How can I reasonably tackle this very problem better? How many threads per block should I request and how do I prevent read access violation even if N and M are large (thousands, tens of thousands)?

Maybe the sheer size of arrays mu1 and mu2 is causing some memory problem, because if M = 2 000 than mu1 and mu2 has size M^2 = 4 000 000. Is this the case? But mu1 and mu2 are really needed, because calculating function mu is rather costly so I wanted to have it precalculated.

Is it possible to distribute the inner loops over threads to alleviate the load? But that could be a problem, as inner loops are collecting a sum so I would need a reduction for znew variable (which stores the sum for the RHS of the equation for Z).

If it helps, I have NVidia GeForce 970 GTX card with 1664 CUDA cores and 4GB video memory.

I’m very sorry, but I’m kinda of a noob and I just started with CUDA. I worked with simple CPU parallelizations with OMP before and I find this GPU parallelization somewhat more intriguing.

Thanks!

P.S.: I’m trying to emulate “2D” arrays mu1 and mu2 of dimensions dim x dim in 1D array like this: for array(n, m) I access element array[dim*n + m]. I hope it’s clear.