Help needed for Texture blending.

Texture blending (or Matrix Blending) is one of the most frequently used operations in computer vision and rendering. We are currently working on a video processing algorithm which really needs intensive texture blending operations. HERE is the problem:

We have 2048 textures, each of which has a dimension of 64x64. Each element of the texture is a 2-byte short-int. Let’s denote the textures as T(i) (0<i<=2048). The size of T(i) is 8K. We want to blend all these textures together with certain weights w(i). So the equation is

V = w(1)T(1) + w(2)T(2) + … + w(2048)T(2048)

The result V is also a 64x64 matrix, each element is 4-byte float. So the result V is 16K.

We are currently doing for following:

  1. Put all T(i) in the texture memory (or global memory)
  2. Allocate 8K of shared memory to store top half the result V.
  3. Loop i from 1 to 2048
  4. get top half of T(i) (2K) into shared memory, and then blend into top half of V (8K)
  5. end loop
  6. Write top half of V into global memory
  7. Doing same thing for bottom half

It seems that the performance is NOT reaching our target right now. The frame rate is 2.5fps on our GTS512.

We really appreciate every comment we can get. If we can get 5 times faster, we will have a significant contribution to computer vision and rendering community.

Thank you!

The thing that is sticking out from your explanation (without seeing the code) is the fact that you are putting T(i) into shared memory, can you not blend it immediately into V?

But without seeing code it is always hard to comment.
If I calculate correctly you are doing (2048*8+16+16) * 2.5 /1024 = 40.0781 MB/s which is way too low.

Now that I think of it, why do you put V into shared memory at all? It is not like V(1,1) is dependent on V(1,2) is it?

I have done something like this in the past. I did it like this:

have a grid of blocks the same size as V (64x64)
have 256 threads per block.
have 256 shared floats per block.
let each thread in a block read in T(threadIdx+n256) at position blockIdx.x, blockIdx.y, and also read in w(threadIdx+n256)
you can do this in a for loop for(int t_index = threadIdx.x ; t_index < 2048 ; t_index += blockDim.x) and if you put #pragma unroll 8 on the line before it the loop will be unrolled.
perform a reduction on the 256 shared floats
write the resulting value to V(blockIdx.x, blockIdx.y)

Like this you get a nice amount of blocks (4096) which use not so much shared memory, so you will get 3 blocks per multiprocessor, and as a result 100% occupancy.

Since I have some code lying, I will post it later adjusted to your problem.

Hmm, have my code at work, so here it is from memory

#define tex_def(n) T_#n

__global__ void blend_kernel(float *V, float *w) {

 __shared float sdata[256];

  sdata[threadIdx.x] = 0.0f;

 #pragma unroll 8

  for (int t_index = threadIdx.x; t_index < 2048; t_index += blockDim.x)

    sdata[threadIdx.x] += w[t_index] * tex2Dfetch(tex_def(t_index), blockIdx.x, blockIdx.y);

 __syncthreads();

 // perform the reduction

  if (threadIdx.x< 128)

    sdata[threadIdx.x] += sdata[threadIdx.x + 128];

  __syncthreads();

 if (threadIdx.x<  64)

    sdata[threadIdx.x] += sdata[threadIdx.x+  64];

  __syncthreads();

    

  if (threadIdx.x < 32)

  {

    sdata[threadIdx.x] += sdata[threadIdx.x+ 32];

    sdata[threadIdx.x] += sdata[threadIdx.x+ 16];

    sdata[threadIdx.x] += sdata[threadIdx.x+  8];

    sdata[threadIdx.x] += sdata[threadIdx.x+  4];

    sdata[threadIdx.x] += sdata[threadIdx.x+  2];

    sdata[threadIdx.x] += sdata[threadIdx.x+  1];

  }

 if (threadIdx.x == 0)

    V[blockIdx.x + 64*blockIdx.y] = sdata[0];

}

And call you kernel like this

blend_kernel<<<dim3(64,64,1), 256>>(d_V, d_w);

I have assumed you did put your T in textures named T_0 till T_2047, but I would not do that personally since each block now only reads 1 value from each T.

just realised that the macro does not work at all, so substitute the tex2Dfetch for what you really have in your code.

Below is a kernel that might do what you want. Each thread computes 2 elements of the matrix for tps (terms per slice) terms. No shared memory. The matrix elements are presumed to be adjacent to each other. I did a little testing, and it seems OK.

To call this you need special values for blockDim and gridDim.

blockDim.x * gridDim.x = MatrixSide * MatrixSide / 2

gridDim.y = NumTerms / tps

The problem you posed has MatrixSide = 64 and NumTerms = 2048, but these can be varied. Note that the threads/block (blockDim.x) can be varied, although blockDim.x = 64 gives pretty good results (even for tps = 2048).

An 8800 GTX can crank these at over 40 iterations/second, including CPU memory transfer time (using cudaAllocHost for faster transfers).

There is some bad news, though, which is the numerical stability. Unless your numbers are fairly well behaved you can lose a lot of precision with very long polynomials. It’s best to accumulate using double precision, but that’s not available in CUDA (unless you roll your own).

__device__ __constant__ float weights[NumTerms];

__global__ static void doTerms(float2 *dst, ushort2 * src, int tps) {

     // each thread handles 2 elements of a matrix for tps terms

     float x = 0.0f;

     float y = 0.0f;

     int step = gridDim.x * blockDim.x;

     int gx = threadIdx.x + blockIdx.x * blockDim.x;

     int dx = step * blockIdx.y;

     int sx = gx + tps * dx;

     int pos = blockIdx.y * tps;

     int lim = pos + tps;

     while (pos < lim) {

           float w = weights[pos];

           ushort2 u2 = src[sx];

           x = x + u2.x * w;

           y = y + u2.y * w;

           sx = sx + step;

           pos = pos + 1;

     }

     dst[gx + dx] = make_float2(x, y);

}

Dennis,

Let me give you a little more information. Our Texture blending is currently running for 2048 times. It means we need to do 2048 matrix blending for 2048 times. The equation looks like this:

V(j) = w(0,j)T(0) + w(1,j)T(1) + … + w(2047,j)T(2047)

For j from 0 to 2047.

Our solution is to compute one result V(j) in one block at a time. That’s why put the result in shared memory to accumulate all the blendinngs, and then write to global memory.

Your suggestion is actually interesting. We can compute one result V(j) using all multiprocessors (64*64 blocks), instead of one black. And we write our the result. And all the threads can work together to next result V(j+1). If we can managed this without handle the control back to host, there will be a great solution.

Thank you.

Then this should do it :

__global__ void blend_kernel(float *V, float *w, float *T) {

__shared float sdata[256];

 sdata[threadIdx.x] = 0.0f;

#pragma unroll 8

 for (int t_index = threadIdx.x; t_index < 2048; t_index += blockDim.x)

   sdata[threadIdx.x] += w[t_index+blockIdx.x*2048] * T(t_index);

// or  sdata[threadIdx.x] += w[t_index*4096+blockIdx.x] * T(t_index);, depends on column/row major

__syncthreads();

// perform the reduction

 if (threadIdx.x< 128)

   sdata[threadIdx.x] += sdata[threadIdx.x + 128];

 __syncthreads();

if (threadIdx.x<  64)

   sdata[threadIdx.x] += sdata[threadIdx.x+  64];

 __syncthreads();

   

 if (threadIdx.x < 32)

 {

   sdata[threadIdx.x] += sdata[threadIdx.x+ 32];

   sdata[threadIdx.x] += sdata[threadIdx.x+ 16];

   sdata[threadIdx.x] += sdata[threadIdx.x+  8];

   sdata[threadIdx.x] += sdata[threadIdx.x+  4];

   sdata[threadIdx.x] += sdata[threadIdx.x+  2];

   sdata[threadIdx.x] += sdata[threadIdx.x+  1];

 }

if (threadIdx.x == 0)

   V[blockIdx.x] = sdata[0];

}

And call you kernel like this

blend_kernel<<<4096, 256>>(d_V, d_w, d_T);

Please note that you very probably did not calculate one V(j) per block and put the result in shared mem. shared mem is shared within a block, not between blocks. So given your explanation your code could not have been correct (or maybe it did the same thing 4096 times, because all blocks were doing the same thing)

So the above kernel should do what you want (within the numerical stability boundaries as stated by RussAtkinson)

Just to make it clear, because I think you misunderstood what the code does:

  • this kernel launches one block per element of V that needs to be calculated. block N calculates V(N)

  • all threads in a block calculate 8 w(k,j)T(k) values and add them up into one shared memory index (2568=2048)

  • then a reduction is performed to sum up all w*T pairs that were calculated by the threads and we end up with the result V(blockIdx.x)

  • that result is then written to one element of the output array(V)

When the kernel is finished, all blocks have run & all elements of V have been filled.

Dennis,

The code is really working fine and looks neat. But we do have a problem for memory access.

The way this code access the texture memory is not coalesced. I know texture memory is cached and coalescing does not affect the performance. However, if we put the texture into global memory, and use coalescing memory access, the performance should be better. (If I remembered correctly.)

To use your code to do coalesced memory access, we need re-organized the all the textures T(i), by but first element of every texture into first row, second element of every texture into second row, and so on. The problem with that is we are generating these textures one by one. If we use the previous way to store it, the writing to global memory is not coalesced.

Yong

Young,

Find my comments/questions below

The access of T is coalesced as far as I read it. T is global memory in the code I posted and not a texture. You cannot have an array of textures, so you need to use global memory anyway.

Also the access of w is coalesced if w[t_index+blockIdx.x*2048] is used as far as I see it.

V(j) = w(0,j)T(0) + w(1,j)T(1) + … + w(2047,j)T(2047)

T is an vector of size 2048 right?

w is a matrix of 2048x4096. maybe you need to transpose w to get coalesced access in the code I wrote.

Can you elaborate why you think the accesses are not coalesced?