Deform Vertex Buffer Object using built-in texture filter hardware

Hi Everybody,

my thesis is rough something about “Accelerating real-time object deformation using CUDA”.
Imagine a regular 3D-grid(not cuda grid)… hopefully someday with some springs in-between the grid connections. I chose the parametrization that each thread is responsable for one cell of the grid. (maybe for a mass-spring system it’s better to say each thread is responsable for one vertex of the grid; I don’t now at this time).
But the main thing is that the threads can communicate with each other via the fast on-chip shared memory.

So far so good… right now I’m trying to deform vertices of an VBO against the grid-vertices(let’s call them) control-vertices. I chose a trilinear interpolation to calculate the new vertex positions. Unfortunately CUDA does not support trilinear hardware interpolation in version 1.0. My idea was to split the trilinear interpolation into two bilinear interpolations using the texture filtering hardware and interpolate linear between the two results by hand.
I need to interpolate positions values. The hardware does not support 3-component textures. Is it the right thought to put the position values of the control-vertices into a 2D-texture with 4-component floats left one float unused? Or is there a more simple way?

The kernel should get:
-A pointer to the undeformed position array of those VBO as a reference.
-A pointer to the VBO float position array.
-Number of vertices of the VBO.
-Some kind of index array which says CUDA which VBO-vertex is attached to what grid-cell. (This is to make deformation of one VBO-vertex depend only on eight control-vertices)
-The 2D texture with the control-verticies positions

The kernel calculates the new positions and writes to the VBO.

Some one had experience with a similar project?
What is a good way to set up the block dimension?
Would you prefer 1D or 3D? (Maybe 2D is also possible)
How would you set up the cuda grid dimension?
e.g. 1000 VBO vertices results in 100 blocks per cuda grid, so every block is processing the vertex position of 10 VBO-vertices.

Some ideas to optimize?.. because CUDA is processing the deformation of all vertices even if they haven’t changed their position.

The mass-spring part needs shared memory communication, not so the deformation part.
So maybe some of you would split this into two kernels to use another kernel parametrizaton for deformation part?

I would appreciate more opinions, further ideas and helpful suggestions.

Thank you and greetings