Newbie Question: Moving nested for loop CPU code to GPU.

Hello everyone,

I’m working on converting some code to do protein dielectric calculations from CPU to GPU. I have the CPU code up and running (It seems, at least), but processing a single protein structure takes about 2 minutes. We are wanting to do this on full simulation files, so roughly 1500 protein structures. I’m hoping to hammer out calculations faster by moving the CPU code to CUDA. The gist of the calculation is that I am calculating the product operator of one atoms distance to every other atom in the system. So the overall CPU code looks a bit like this:

//2D structure is [atom index,(x,y,z)]
double[] atoms = scrapeAtomData(pathtofile);
double[] molDensity = new double[atoms_count];
for(int i = 0; i < atoms.Length; i++)
	var density = 1.0;
	for(int j = 0; j < atoms.Length; j++)
		if(i != j)  //Ensures product operator does not result in 0
			var diffx = atoms[i,0] - atoms[j,0];
			var diffy = atoms[i,1] - atoms[j,1];
			var diffz = atoms[i,2] - atoms[j,2];
			density *= Math.Sqrt((diffx * diffx) + (diffy * diffy) + (diffz * diffz));
	moldensity[i] = density;

For GPU calculations, I convert the data to a 1D array in the format of {x1, y1, z1, x2, y2, z2…} and try something to the effect of

diffz = atoms[i + (stride_x * 3)] - atoms[j + (stride_y * 3)]

or all the different permutations of stride and position you can think of. I’ve tried just “sticking the code” into a kernel, using blockDim * blockId + threadId in the x and y for i and j, but that produces “fun” results (the values returned change around each time I run the test, leading me to believe that I’m having an issues with all the threads not being populated in total before some calculations start). I’ve also tried getting rid of the for loop structures entirely and running the calculation to no success. I was just wondering if there was a “common structure” to run nested for loops in this manner. I know that nesting for loops is highly dependent on what the desired outcome is, but I’m hoping the above code gives enough insight as to what I’m attempting to do. Half the problem is I still don’t fully understand what is going on at the architecture/“loading” level, but I’m hoping figuring this out will fix that issue. Thank you for your time!

for you. moldensity_CPU and moldensity_GPU make identical results.

not tuned/optimized.

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <iostream>
#include <vector>
#include <numeric>
#include <random>
#include <cmath>
#include <cassert>
using namespace std;

struct atom {
  double x;
  double y;
  double z;

const unsigned int N = 1500;

void moldensity_CPU(const vector<atom>& atoms, vector<double>& moldensity) {
  for ( int i = 0; i < atoms.size(); ++i ) {
    double density = 0.0;
    for ( int j = 0; j  < atoms.size(); ++j ) {
      if ( i != j ) {
        double diffx = atoms[i].x - atoms[j].x;
        double diffy = atoms[i].y - atoms[j].y;
        double diffz = atoms[i].z - atoms[j].z;
        density += sqrt(diffx*diffx + diffy*diffy + diffz*diffz);
    moldensity[i] = density;

__device__ double atomicAdd(double* address, double val) { 
  unsigned long long int* address_as_ull = (unsigned long long int*)address; 
  unsigned long long int old = *address_as_ull, assumed; 
  do { 
    assumed = old; 
    old = atomicCAS(address_as_ull, assumed, 
                    __double_as_longlong(val + __longlong_as_double(assumed)));
  } while (assumed != old); 
  return __longlong_as_double(old); 

__global__ void kernel_moldensity(const atom* atoms, double* moldensity, unsigned int size) {
  unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
  unsigned int j = blockDim.y * blockIdx.y + threadIdx.y;
  if ( i < size && j < size && i != j ) {
    double diffx = atoms[i].x - atoms[j].x;
    double diffy = atoms[i].y - atoms[j].y;
    double diffz = atoms[i].z - atoms[j].z;
    atomicAdd(moldensity+i, sqrt(diffx*diffx + diffy*diffy + diffz*diffz));

void moldensity_GPU(const vector<atom>& atoms, vector<double>& moldensity) {
  assert( atoms.size() == moldensity.size());
  unsigned int size = (unsigned int)atoms.size();
  atom* d_atoms;
  cudaMalloc(&d_atoms, size * sizeof(atom));
  double* d_moldensity;
  cudaMalloc(&d_moldensity, size * sizeof(double));
  cudaMemset(d_moldensity,0, size * sizeof(double));
  cudaMemcpy(d_atoms,, size * sizeof(atom), cudaMemcpyHostToDevice);
  dim3 block(32,8);
  dim3 grid((size+31)/32, (size+7)/8);
  kernel_moldensity<<<grid, block>>>(d_atoms, d_moldensity, size);
  cudaMemcpy(, d_moldensity, size * sizeof(double), cudaMemcpyDeviceToHost);

int main() {
  vector<atom> atoms(N);
  vector<double> moldensity(N);

  // fill_random
  mt19937 gen;
  normal_distribution<double> dist;
  for ( int i = 0; i < N; ++i ) {
    atoms[i].x = dist(gen);
    atoms[i].y = dist(gen);
    atoms[i].z = dist(gen);
  moldensity_CPU(atoms, moldensity);
  cout << "CPU : " 
       << accumulate(begin(moldensity), end(moldensity), 0.0) << endl;

  moldensity_GPU(atoms, moldensity);
  cout << "GPU : " 
       << accumulate(begin(moldensity), end(moldensity), 0.0) << endl;


in addition, moldensity_GPU takes 43microseconds at GeForce GT720(192 CUDA-cores)
EXCLUDES cudaMemcpy (that takes long time!).

should not be too expensive to rewrite the input data in the following format

atoms[x1, …, xn; y1, …, yn; z1, …, zn]

you can use a single 1d array, with 3 pointers into it as well

__global__ void the_useless_debugger(double* atoms, double* density)
__shared__ double s_dbl[3];
__shared__ double* atom[3];
__shared__ double* density_;

if (threadIdx.x < 3)
int lint1;

lint1 = (arr_length * threadIdx.x) + blockIdx.x;
s_dbl[threadIdx.x] = atoms[lint1];
atom = &atoms[arr_length * threadIdx.x);

if (threadIdx.x == 0)
density_ = &density[arr_length * blockIdx.x];


int lint1;
double dbl1, dbl2, dbl3, dbl4, dbl5;

lint1 = threadIdx.x;
dbl5 = 1;

while (lint1 < arr_length)
if (lint1 != blockIdx.x)
dbl1 = atom[0][lint1] - s_dbl[0]; 
dbl2 = atom[1][lint1] - s_dbl[1]; 
dbl3 = atom[2][lint1] - s_dbl[2]; 

dbl1 = 1;
dbl2 = 1;
dbl3 = 1;

dbl4 = Math.Sqrt((diffx * diffx) + (diffy * diffy) + (diffz * diffz)); // rewrite ito dbl1, dbl2, dbl3;
dbl5 *= dbl4;

lint1 += blockDim.x;

density_[threadIdx.x] = dbl5;

the kernel dimensions:

dim3 dGx(arr_length, 1, 1);
dim3 dBx(128, 1, 1); // can be a larger/ smaller x dimension

in the above, arr_length = nr of atoms

in the end, issue another kernel(s) to product scan the now resultant density array - keep in mind that the array length then would be the block x dimension, not the nr of atoms
there may be one or two errors in the code above, but the principal should be fine
without arguing that the work distribution is optimal, you should already find it an improvement on the current

Now that I’ve finally got some free time, I figured I’d post how I managed to fix things and do a bit of an optimization so that the distance calculation is only done half as much (not sure if it really has any impact since this is running in parallel now, but whatever.) I decided to go with splitting the procedure into two kernels and running them sequentially. The first kernel calculates a distance matrix from all the points to each other. Since this would normally make a symmetric matrix mirrored about the x=y line, I made it to where the kernel only calculates on half of the matrix (the optimization I was talking about before.) Once that is calculated and stored in memory, I then run a kernel that calculates the product operator for each atom index. Since the second half of the matrix is missing now, the kernel moves along the row until it hits the x=y point, then “reflects” and starts traveling down the column to properly calculate the net product. The successful kernel setup moves a 10000 atom hypothetical calculation from ~18.5 seconds (i7 3.05 GHz, single thread) to ~675ms (GeForce GTX 660 Ti). Below are the two kernels I made (this was done in C# using Alea, but the code should be easy to move to other languages.)


//Calculates the distance "half-matrix", which is stored at "doutputs"
private static void matrixKernel(deviceptr<double> doutputs, deviceptr<GPUAtom> dinputs, int n)
    var i = blockIdx.x * blockDim.x + threadIdx.x;
    var j = blockIdx.y * blockDim.y + threadIdx.y;
    var stridex = gridDim.x * blockDim.x;
    var stridey = gridDim.y * blockDim.y;
    if (i < j+1 && j < n && i != j)
        var diffx = dinputs[i].x - dinputs[j].x;
        var diffy = dinputs[i].y - dinputs[j].y;
        var diffz = dinputs[i].z - dinputs[j].z;
        doutputs[(j * n) + i] = Math.Sqrt((diffx * diffx) + (diffy * diffy) + (diffz * diffz));


//Takes the half-matrix from above and calculates the product operators
private static void productKernel(deviceptr<double> doutputs, deviceptr<double> dinputs, int n)
    var i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i >= n)
    doutputs[i] = 1.0;
    for(var x = 0; x < i; x++)
        doutputs[i] *= dinputs[(i * n) + x];
    for(var y = i+1; y < n; y++)
        doutputs[i] *= dinputs[(y * n) + i];

And the actual initialization code (probably not fully correct, as I’m still quite new to CUDA programming):

private void buttonTest_Click(object sender, EventArgs e)
    Worker worker = Worker.Default;  //Setup initial worker (from ALEA)
    GPUAtom[] testatoms = new GPUAtom[(int)numericTest.Value]; //GPUAtom is simple struct with x, y, z, and vdw variables.
    double[] cpuout = new double[testatoms.Length];
    for (int i = 0; i < testatoms.Length; i++) //Fill atoms with test junk data.
        testatoms[i] = new GPUAtom(i * 0.1, i * 0.1, i * 0.1, 1.0);
    using (var dinputs = worker.Malloc(testatoms))
    using (var ddensitymatrix = worker.Malloc<double>(testatoms.Length * testatoms.Length))
    using (var doutputs = worker.Malloc<double>(testatoms.Length))
        //Attempt to make program platform independant, probably bad.
        int blockDim = (int)Math.Sqrt(worker.Device.Attributes.MAX_THREADS_PER_BLOCK); 

        var blockSize = new dim3(blockDim, blockDim);
        var gridSize = new dim3((blockDim - 1 + testatoms.Length) / blockDim, (blockDim - 1 + testatoms.Length) / blockDim);
        var lp = new LaunchParam(gridSize, blockSize);
        worker.Launch(matrixKernel, lp, ddensitymatrix.Ptr, dinputs.Ptr, testatoms.Length);
        var densitymatrix = ddensitymatrix.Gather();  //Gathered for debugging, not really necessary.

        var blockSize2 = worker.Device.Attributes.MAX_THREADS_PER_BLOCK;
        var gridSize2 = Math.Min(16 * worker.Device.Attributes.MULTIPROCESSOR_COUNT, Common.divup(testatoms.Length, blockSize2));  //Straight from ALEA getting started guide.
        var lp2 = new LaunchParam(gridSize2, blockSize2);
        worker.Launch(productKernel, lp2, doutputs.Ptr, ddensitymatrix.Ptr, testatoms.Length);
        var gpuout = doutputs.Gather();

Hopefully this helps some people out, if only as a potential “This works, but is bad” example.