A global memory operation takes around 200-600 clock cycles depending on the used video card.
How does a simple if statement affect clock cycles.
Whats best practice for this.
For example in a few kernels i use a loop to multiply values in a global memory array: (//reduce global memory write operations)
__global__ void normColumn(float** inOutMat_g,
const unsigned int inOutputTileCount_s)
{
//... do necessary stuff to calculate column vector reciprocal length
float value = 0.0f;
unsigned int idx1 = 0;
//output as many row cells this threads is responsible of
for (int b = 0; b < inOutputTileCount_s; b++)
{
//set idx1 index and check range!
value = inOutMat_g[idx1][blockId];
//reduce global memory write operations
if (value > 0.0f)
inOutMat_g[idx1][blockId] = value * rSqrtdotSum_s;
}
}