I am learning how to use CUDA to speed up 3d plugins for Maya. A plugin that I am investigating smooths a 3d mesh: for each vertex on the mesh, its value is equal to the average value of it’s neighbors. I read the documentation and it warned me about the slowdowns with non coalesced memory access, but I am getting unexpected results (garbage meshes) back in Maya when I try to access the data in a non coalesced way.
Accessing the neighboring vertices may lead to a non-coalesced access. For example: the neighbor of vertex 0 may be vertex 1, vertex 11 and vertex 12. I use an in array to store these indices, and use a variable “index” to access the input points to the relevant neighbors.
This is how the relevant part of the kernel looks like :
//origPoints: input points as a float Array (size is 3 times that of number of points, one float each for x,y,z dimension)
//outPoints : output the results as a Vector (custom defined structure with 4 floats for x,y,z,w components)
//networkVerts : Neighbor indices for each point combined as a full array
//networkVertsIndexer: Helper structure to navigate through the networkVerts array
//data: pointer to a single object that has maya specific information
__global__ void deformOnGPU(float *origPoints, Vector *outPoints,int *networkVerts, arrayIndexer *networkVertsIndexer, smoothDeformerData *data)
{
//index into the arrays
int i = blockIdx.x * blockDim.x + threadIdx.x;
Vector inPoint;
//Vector to store the average value
Vector avg;
avg.x = 0;
avg.y = 0;
avg.z = 0;
avg.z = 0;
// Lets assume (the original code has different structure for this) that each vertex has 3 neighbors
for(int j = 0 ;j<data->3;j++)
{
Vector neighbor;
int index = networkVerts[networkVertsIndexer[i].startIndex+(j)];
// If I use index to access origPoints (which gives me the neighbors, which is what is needed) it fails: non-coalesced access.
// If I use i to access origPoints (which gives the same point back), it works :-> Coalesced access
// I have verified that I am getting the proper values for index.
neighbor.x = origPoints[3*index];
neighbor.y = origPoints[3*index+1];
neighbor.z = origPoints[3*index+2];
neighbor.w = 1;
avg = avg.add(neighbor);
}
}
My kernel parameters are:
for n points:
deformOnGPU <<< n_blocks, block_size >>>
cudaDeviceProp prop;
cudaError_t error = cudaGetDeviceProperties(&prop,0);
n_blocks = data->numPoints/block_size + (data->numPoints%block_size == 0 ? 0:1);
block_size = nearest_pow(prop.regsPerBlock/32); // nearest pow returns nearest power of 2.
My questions are:
-
Why should this be failing and not experiencing a slowdown?
-
What should I do to make such an access coalesced, considering the number of points may vary from 4 to 6 million?