Optimization Help constant memory

After alot of frustration I finally have my program running using CUDA, and it is significantly faster than my first attempt that only used the CPU. My next step is to find ways to optimize it. I have identified one are which I’m pretty sure would benefit from some optimization but i’m unsure how to go about it.

My simulation takes a bunch of atomic coordinates {x,y,z} and plots out a 2D potential by summing the contributions from atoms nearby. So every thread works out its location in 2D space then for every atom checks if its also nearby, and if it is adds a contribution for that atom.

At present I’ve not optimised anything so everything is in global memory. My guess is that the best way to optimise this would be to have all the atom coordinates in constant memory as every thread reads all of them. I just don’t know how to do this. Specifically I’m unsure how to proceed in the case that I have more atoms than I can hold in constant memory. I presume it is possible the break it down into chunks that fit and just do multiple kernel executions for each set.

The last problem is less important for know as currently I’m only working on small data sets. I would just like to know how i get an array of data on the host, into constant memory, and how I access it within a kernel.

The following is my first unoptimized version of the kernel. I passed in the atomic number, x , y , and z positions as seperate arrays in global memory.

__global__ void secondKernel(cuComplex * diff, int samplex, int sampley, int * atomz, float * atomx, float * atomy, float * atompz, int noatoms, float pixelscale, float topz, float dz, float * devfparams, float sigma) {

	float vzatom2(int Z, float radius, float * devfparams);

	float sumz = 0;

	float xdist = 0;

	float ydist = 0;


	int xIndex = threadIdx.x + blockIdx.x * blockDim.x;

	int yIndex = threadIdx.y + blockIdx.y * blockDim.y;


	if (xIndex >= samplex || yIndex >= sampley)

		return ;


	int Index = (yIndex * samplex) + xIndex;


		for ( int v = 0 ; v < noatoms ; v++ )



			if ( abs(atompz[v]-(topz-(dz/2))) < dz )


				xdist = xIndex * pixelscale - atomx[v];

				ydist = yIndex * pixelscale - atomy[v];

				if (xdist*xdist + ydist*ydist < 5 )


					sumz += vzatom2(atomz[v] , xdist*xdist + ydist*ydist , devfparams);




	diff[Index].x = cosf ( sigma * sumz  / 1000);

	diff[Index].y = sinf ( sigma * sumz  / 1000);


EDIT: I think I have sorted out this problem with regards to constant memory now. I changed all the coordinate arrays to const memory and it got approx. 7x faster. I think I have a plan for when I have large numbers of atoms too that should be too difficult to implement.

My new problem is occupancy < 0.5 because of 21 registers per thread. All the branching probably doesnt help anything either, and no coalesced memory reads :(

if you are doing something like

if (a < b) {
c += d

you could just as well do c += d * (a < b);

I’m not sure how viable this is as the value “d” from your example would be the result of a function call that determines the atomic potential at a given distance from a fairly complex function. Would this method not require this value to be calculated even when a > b which would increase the time it takes? If so I could just include contributions from all atoms regardless of distance, this would be more accurate but they don’t contribute much so i assumed it would be faster to only add the large contributions.

Also I’m not entirely sure what happens for the divergent branches but each block is roughly a square subregion of the image so in the majority of cases when the distance a > b for one thread in a block it will be for the others aswell.