Hello everyone,

I’m working on converting some code to do protein dielectric calculations from CPU to GPU. I have the CPU code up and running (It seems, at least), but processing a single protein structure takes about 2 minutes. We are wanting to do this on full simulation files, so roughly 1500 protein structures. I’m hoping to hammer out calculations faster by moving the CPU code to CUDA. The gist of the calculation is that I am calculating the product operator of one atoms distance to every other atom in the system. So the overall CPU code looks a bit like this:

```
//2D structure is [atom index,(x,y,z)]
double[] atoms = scrapeAtomData(pathtofile);
double[] molDensity = new double[atoms_count];
for(int i = 0; i < atoms.Length; i++)
{
var density = 1.0;
for(int j = 0; j < atoms.Length; j++)
{
if(i != j) //Ensures product operator does not result in 0
{
var diffx = atoms[i,0] - atoms[j,0];
var diffy = atoms[i,1] - atoms[j,1];
var diffz = atoms[i,2] - atoms[j,2];
density *= Math.Sqrt((diffx * diffx) + (diffy * diffy) + (diffz * diffz));
}
}
moldensity[i] = density;
}
```

For GPU calculations, I convert the data to a 1D array in the format of {x1, y1, z1, x2, y2, z2…} and try something to the effect of

```
diffz = atoms[i + (stride_x * 3)] - atoms[j + (stride_y * 3)]
```

or all the different permutations of stride and position you can think of. I’ve tried just “sticking the code” into a kernel, using blockDim * blockId + threadId in the x and y for i and j, but that produces “fun” results (the values returned change around each time I run the test, leading me to believe that I’m having an issues with all the threads not being populated in total before some calculations start). I’ve also tried getting rid of the for loop structures entirely and running the calculation to no success. I was just wondering if there was a “common structure” to run nested for loops in this manner. I know that nesting for loops is highly dependent on what the desired outcome is, but I’m hoping the above code gives enough insight as to what I’m attempting to do. Half the problem is I still don’t fully understand what is going on at the architecture/“loading” level, but I’m hoping figuring this out will fix that issue. Thank you for your time!