if statement vs global memory write operation for multiply operation which is faster?

A global memory operation takes around 200-600 clock cycles depending on the used video card.

How does a simple if statement affect clock cycles.

Whats best practice for this.

For example in a few kernels i use a loop to multiply values in a global memory array: (//reduce global memory write operations)

__global__ void normColumn(float** inOutMat_g,

	const unsigned int inOutputTileCount_s)


	//... do necessary stuff to calculate column vector reciprocal length

	float value = 0.0f;

	unsigned int idx1 = 0;

	//output as many row cells this threads is responsible of

	for (int b = 0; b < inOutputTileCount_s; b++)


		//set idx1 index and check range!

		value = inOutMat_g[idx1][blockId];

		//reduce global memory write operations

		if (value > 0.0f)

			inOutMat_g[idx1][blockId] = value * rSqrtdotSum_s;