Texture blending (or Matrix Blending) is one of the most frequently used operations in computer vision and rendering. We are currently working on a video processing algorithm which really needs intensive texture blending operations. HERE is the problem:

We have 2048 textures, each of which has a dimension of 64x64. Each element of the texture is a 2-byte short-int. Let’s denote the textures as T(i) (0<i<=2048). The size of T(i) is 8K. We want to blend all these textures together with certain weights w(i). So the equation is

V = w(1)T(1) + w(2)T(2) + … + w(2048)T(2048)

The result V is also a 64x64 matrix, each element is 4-byte float. So the result V is 16K.

We are currently doing for following:

- Put all T(i) in the texture memory (or global memory)
- Allocate 8K of shared memory to store top half the result V.
- Loop i from 1 to 2048
- get top half of T(i) (2K) into shared memory, and then blend into top half of V (8K)
- end loop
- Write top half of V into global memory
- Doing same thing for bottom half

It seems that the performance is NOT reaching our target right now. The frame rate is 2.5fps on our GTS512.

We really appreciate every comment we can get. If we can get 5 times faster, we will have a significant contribution to computer vision and rendering community.

Thank you!