Hi everybody,
I’m developping a program with CUDA which :

computes the height values of a 3D model from a direction and a grid (“dexel” structure of the model)

computes the intersections of this structure with another 3D model
The goal of this is to represent the 1st model with points, and to be able to update this model (at framerate) from the intersections with another moving model.
It’s my first program in CUDA, so I think I did lots of mistakes and missed optimizations. Could you please help me by giving me some advices / ideas to improve it. I think, it’s like a kind of raycasting and depth peeling.
Here’s how I do it :
 computing the height values
Each block corresponds to a tile of the grid and each thread launches a ray (kind of ray casting). So I can get the zvalues for each point of the grid (which then have to be sorted).
 updating from intersection
I do the same thing with the other model, and having zmin and zmax (intersections min and max of the 2nd model for each ray)
I update the dexel structure.
Optimizations :

before the first compute, the 1st model’s triangles are classified in the blocks where they belong. Then they are loaded in shared memory so each thread of the block can access them fast.

for the second compute (update with 2nd model intersection) , I can’t classify (I did not find the way to do it) 2nd model’s triangles because, the model is moving, so the classification should be done again at each frame.
I have an acceleration with the 2nd model’s bounding sphere which helps me skipping some threads (those which does not intersect the sphere).
Here are some problems that I’ve seen from cudaprof (there must be a lot of other things to improve) :

global memory load, not coalesced : the mesh triangles are stored in global memory and each thread loops on the same triangles (even in the 1st compute where they are moved in shared memory, I have bad coalesced memory load)

the intersection tests is very divergent :
float t = RayTriangleIntersection(r, v0, e1, e2); if (t>0.001) { // add a height value }

the update from intersection is very divergent :
 storing zmin and zmax from 2nd model
 updating height values from zmin and zmax (depends of comparisons between heights and zmin/zmax)

global memory store not coalesced :
My height values are stored in a buffer of structures (one structure by thread) so it should be coalesced. Each structure contains a buffer of zvalues, a size, some other things… cudaprod says it’s not coalesced.
I hope you can help me, and give me some ideas to improve the program.
If any questions, I can be more precise
Thanks!
Thibo