non coelasced global memory access

I am learning how to use CUDA to speed up 3d plugins for Maya. A plugin that I am investigating smooths a 3d mesh: for each vertex on the mesh, its value is equal to the average value of it’s neighbors. I read the documentation and it warned me about the slowdowns with non coalesced memory access, but I am getting unexpected results (garbage meshes) back in Maya when I try to access the data in a non coalesced way.

Accessing the neighboring vertices may lead to a non-coalesced access. For example: the neighbor of vertex 0 may be vertex 1, vertex 11 and vertex 12. I use an in array to store these indices, and use a variable “index” to access the input points to the relevant neighbors.

This is how the relevant part of the kernel looks like :

//origPoints: input points as a float Array (size is 3 times that of number of points, one float each for x,y,z dimension)

//outPoints : output the results as a Vector (custom defined structure with 4 floats for x,y,z,w components)

//networkVerts : Neighbor indices for each point combined as a full array

//networkVertsIndexer: Helper structure to navigate through the networkVerts array

//data: pointer to a single object that has maya specific information

__global__ void deformOnGPU(float *origPoints, Vector *outPoints,int *networkVerts, arrayIndexer *networkVertsIndexer, smoothDeformerData *data)

{

	//index into the arrays

	int i = blockIdx.x * blockDim.x + threadIdx.x;

	Vector inPoint;

//Vector to store the average value

	Vector avg;

	avg.x = 0;

	avg.y = 0;

	avg.z = 0;

	avg.z = 0;

	// Lets assume (the original code has different structure for this) that each vertex has 3 neighbors

        for(int j = 0 ;j<data->3;j++)

	{

	        Vector neighbor;

				          		

                int index = networkVerts[networkVertsIndexer[i].startIndex+(j)];

		

                // If I use index to access origPoints (which gives me the neighbors, which is what is needed) it fails: non-coalesced access.

                // If I use i to access origPoints (which gives the same point back), it works :-> Coalesced access

                // I have verified that I am getting the proper values for index. 

neighbor.x = origPoints[3*index];   

		neighbor.y = origPoints[3*index+1]; 

		neighbor.z = origPoints[3*index+2];

		neighbor.w = 1;

		avg = avg.add(neighbor);

	}

}

My kernel parameters are:

for n points:

deformOnGPU <<< n_blocks, block_size >>> 

cudaDeviceProp prop;

cudaError_t error = cudaGetDeviceProperties(&prop,0);

n_blocks = data->numPoints/block_size + (data->numPoints%block_size == 0 ? 0:1);

block_size = nearest_pow(prop.regsPerBlock/32); // nearest pow returns nearest power of 2.

My questions are:

  1. Why should this be failing and not experiencing a slowdown?

  2. What should I do to make such an access coalesced, considering the number of points may vary from 4 to 6 million?