After alot of frustration I finally have my program running using CUDA, and it is significantly faster than my first attempt that only used the CPU. My next step is to find ways to optimize it. I have identified one are which I’m pretty sure would benefit from some optimization but i’m unsure how to go about it.
My simulation takes a bunch of atomic coordinates {x,y,z} and plots out a 2D potential by summing the contributions from atoms nearby. So every thread works out its location in 2D space then for every atom checks if its also nearby, and if it is adds a contribution for that atom.
At present I’ve not optimised anything so everything is in global memory. My guess is that the best way to optimise this would be to have all the atom coordinates in constant memory as every thread reads all of them. I just don’t know how to do this. Specifically I’m unsure how to proceed in the case that I have more atoms than I can hold in constant memory. I presume it is possible the break it down into chunks that fit and just do multiple kernel executions for each set.
The last problem is less important for know as currently I’m only working on small data sets. I would just like to know how i get an array of data on the host, into constant memory, and how I access it within a kernel.
The following is my first unoptimized version of the kernel. I passed in the atomic number, x , y , and z positions as seperate arrays in global memory.
__global__ void secondKernel(cuComplex * diff, int samplex, int sampley, int * atomz, float * atomx, float * atomy, float * atompz, int noatoms, float pixelscale, float topz, float dz, float * devfparams, float sigma) {
float vzatom2(int Z, float radius, float * devfparams);
float sumz = 0;
float xdist = 0;
float ydist = 0;
int xIndex = threadIdx.x + blockIdx.x * blockDim.x;
int yIndex = threadIdx.y + blockIdx.y * blockDim.y;
if (xIndex >= samplex || yIndex >= sampley)
return ;
int Index = (yIndex * samplex) + xIndex;
for ( int v = 0 ; v < noatoms ; v++ )
{
if ( abs(atompz[v]-(topz-(dz/2))) < dz )
{
xdist = xIndex * pixelscale - atomx[v];
ydist = yIndex * pixelscale - atomy[v];
if (xdist*xdist + ydist*ydist < 5 )
{
sumz += vzatom2(atomz[v] , xdist*xdist + ydist*ydist , devfparams);
}
}
}
diff[Index].x = cosf ( sigma * sumz / 1000);
diff[Index].y = sinf ( sigma * sumz / 1000);
}
EDIT: I think I have sorted out this problem with regards to constant memory now. I changed all the coordinate arrays to const memory and it got approx. 7x faster. I think I have a plan for when I have large numbers of atoms too that should be too difficult to implement.
My new problem is occupancy < 0.5 because of 21 registers per thread. All the branching probably doesnt help anything either, and no coalesced memory reads :(