Performance issue using text2D


my application contains a kernel that implements in essence a double sum over weighted inputs (16 rows, 32 colums = 512 threads in a block, input is shared memory populated earlier in the kernel)

float2 sum = make_float2(0.0f,0.0f);
int index = 0;
for (int row = 0; row < blockDim.y; ++row)
    float2 colSum = make_float2(0.0f, 0.0f);
    for (int col = 0; col < blockDim.x; ++col)
        float2 colFactor = tex2D(colTexture, col, threadIdx.x);
        colSum += colFactor * inputs[index++];
    float2 rowFactor = tex2D(rowTexture, row, threadIdx.y);
    sum += rowFactor * colSum;

In the current implementation I am using textures to store the pre-calculated weights, which is better than using const memory, which is again better than calculating the weights in the kernel.

I understand that the const memory does not perform well if all threads in a block access different elements in the 2D weight tables. I was hoping that textures would be much faster, but unfortunately, they or the way I use them tells me not.

What could I do in the above loop to make the code execute faster? Or is there an alternative implementation, perhaps using more shared memory?

Kind regards,