Hello,
my application contains a kernel that implements in essence a double sum over weighted inputs (16 rows, 32 colums = 512 threads in a block, input is shared memory populated earlier in the kernel)
float2 sum = make_float2(0.0f,0.0f);
int index = 0;
for (int row = 0; row < blockDim.y; ++row)
{
float2 colSum = make_float2(0.0f, 0.0f);
for (int col = 0; col < blockDim.x; ++col)
{
float2 colFactor = tex2D(colTexture, col, threadIdx.x);
colSum += colFactor * inputs[index++];
}
float2 rowFactor = tex2D(rowTexture, row, threadIdx.y);
sum += rowFactor * colSum;
}
In the current implementation I am using textures to store the pre-calculated weights, which is better than using const memory, which is again better than calculating the weights in the kernel.
I understand that the const memory does not perform well if all threads in a block access different elements in the 2D weight tables. I was hoping that textures would be much faster, but unfortunately, they or the way I use them tells me not.
What could I do in the above loop to make the code execute faster? Or is there an alternative implementation, perhaps using more shared memory?
Kind regards,
peter