Code 4 times slower with "arch=sm_20"

Just to clarify. Numerical constants can also be interpreted as doubles.

25.0 = double

25.0f = float

PI = most likely defined as double

N.

I write all numerical constants with an f in the end.

and no intrinsics called…? I mean, sqrt, tan - that in C are double by default… I don’t know what nvcc thinks about them - I would check it.

you know, without the code we are just guessing - you should try and send some piece of code.

How do you compile the baseline you are comparing to - with “arch=sm_13”, or without any arch option?

This is one kernel that becomes 4 times slower

__global__ void	Calculate_A_matrix_and_h_vector_2D_values(float* A_matrix_2D_values, float* h_vector_2D_values, float4* Phase_Differences, float4* Phase_Gradients, float4* Certainties, int DATA_W, int DATA_H, int DATA_D, int DATA_W_times_DATA_H, int FILTER_SIZE)

{

	volatile int y = threadIdx.x;

	volatile int z = blockIdx.x;

	

	//volatile int y = blockIdx.x * blockDim.x + threadIdx.x;

	//volatile int z = blockIdx.y * blockDim.y + threadIdx.y;

	if (((y >= (FILTER_SIZE - 1)/2) && (y < DATA_H - (FILTER_SIZE - 1)/2)) && ((z >= (FILTER_SIZE - 1)/2) && (z < DATA_D - (FILTER_SIZE - 1)/2)))

	{

		float phase_difference, certainty, phase_gradient;

		float xf, yf, zf;

		int matrix_element_idx, vector_element_idx;

		float A_matrix_2D_value[10], h_vector_2D_value[4];

		int idx;

		yf = (float)y - ((float)DATA_H - 1.0f) * 0.5f;

		zf = (float)z - ((float)DATA_D - 1.0f) * 0.5f;

		// X

		A_matrix_2D_value[0] = 0.0f;

		A_matrix_2D_value[1] = 0.0f;

		A_matrix_2D_value[2] = 0.0f;

		A_matrix_2D_value[3] = 0.0f;

		A_matrix_2D_value[4] = 0.0f;

		A_matrix_2D_value[5] = 0.0f;

		A_matrix_2D_value[6] = 0.0f;

		A_matrix_2D_value[7] = 0.0f;

		A_matrix_2D_value[8] = 0.0f;

		A_matrix_2D_value[9] = 0.0f;

		h_vector_2D_value[0] = 0.0f;

		h_vector_2D_value[1] = 0.0f;

		h_vector_2D_value[2] = 0.0f;

		h_vector_2D_value[3] = 0.0f;

					

		for (int x = (FILTER_SIZE - 1)/2; x < (DATA_W - (FILTER_SIZE - 1)/2); x++)

		{

			xf = (float)x - ((float)DATA_W - 1.0f) * 0.5f;

			//idx = Calculate_3D_Index(x, y, z, DATA_W, DATA_W_times_DATA_H);

			idx = Calculate_3D_Index_Fermi(x, y, z, DATA_W, DATA_W_times_DATA_H);

			phase_difference = Phase_Differences[idx].x;

			phase_gradient = Phase_Gradients[idx].x;

			certainty = Certainties[idx].x;

			float c_pg_pg = certainty * phase_gradient * phase_gradient;

			float c_pg_pd = certainty * phase_gradient * phase_difference;

			

			A_matrix_2D_value[0] += c_pg_pg;

			A_matrix_2D_value[1] += xf * c_pg_pg;

			A_matrix_2D_value[2] += yf * c_pg_pg;

			A_matrix_2D_value[3] += zf * c_pg_pg;

			A_matrix_2D_value[4] += xf * xf * c_pg_pg;

			A_matrix_2D_value[5] += xf * yf * c_pg_pg;

			A_matrix_2D_value[6] += xf * zf * c_pg_pg;

			A_matrix_2D_value[7] += yf * yf * c_pg_pg;

			A_matrix_2D_value[8] += yf * zf * c_pg_pg;

			A_matrix_2D_value[9] += zf * zf * c_pg_pg;

			h_vector_2D_value[0] += c_pg_pd;

			h_vector_2D_value[1] += xf * c_pg_pd;

			h_vector_2D_value[2] += yf * c_pg_pd;

			h_vector_2D_value[3] += zf * c_pg_pd;

		}

		matrix_element_idx = y + z * DATA_H;

		A_matrix_2D_values[matrix_element_idx] = A_matrix_2D_value[0];

		matrix_element_idx += DATA_H * DATA_D;

		A_matrix_2D_values[matrix_element_idx] = A_matrix_2D_value[1];

		matrix_element_idx += DATA_H * DATA_D;

		A_matrix_2D_values[matrix_element_idx] = A_matrix_2D_value[2];

		matrix_element_idx += DATA_H * DATA_D;

		A_matrix_2D_values[matrix_element_idx] = A_matrix_2D_value[3];

		matrix_element_idx += DATA_H * DATA_D;

		A_matrix_2D_values[matrix_element_idx] = A_matrix_2D_value[4];

		matrix_element_idx += DATA_H * DATA_D;

		A_matrix_2D_values[matrix_element_idx] = A_matrix_2D_value[5];

		matrix_element_idx += DATA_H * DATA_D;

		A_matrix_2D_values[matrix_element_idx] = A_matrix_2D_value[6];

		matrix_element_idx += DATA_H * DATA_D;

		A_matrix_2D_values[matrix_element_idx] = A_matrix_2D_value[7];

		matrix_element_idx += DATA_H * DATA_D;

		A_matrix_2D_values[matrix_element_idx] = A_matrix_2D_value[8];

		matrix_element_idx += DATA_H * DATA_D;

		A_matrix_2D_values[matrix_element_idx] = A_matrix_2D_value[9];

		vector_element_idx = y + z * DATA_H;

		h_vector_2D_values[vector_element_idx] = h_vector_2D_value[0];

		vector_element_idx += 3 * DATA_H * DATA_D;

		h_vector_2D_values[vector_element_idx] = h_vector_2D_value[1];

		vector_element_idx += DATA_H * DATA_D;

		h_vector_2D_values[vector_element_idx] = h_vector_2D_value[2];	

		vector_element_idx += DATA_H * DATA_D;

		h_vector_2D_values[vector_element_idx] = h_vector_2D_value[3];

}

I gave it a try yesterday night - and compared the resulting ptx with default sm=10 and sm=20.
No important differences found, even if I believe there are some points suitable to optimization.

One possibility is that
idx = Calculate_3D_Index_Fermi(x, y, z, DATA_W, DATA_W_times_DATA_H);

goes really wrong - I can guess what it does, but if you post it I can try it too… only pity is that I have not a sm=20 capable gpu at this time.
Sure I would try to inline it, since parameters go thorugh global memory - this is evident int the PTX, in both cases: you are in the inner loop - you do not want it.

apart some different arrangement the main difference in the generated PTX is that with sm=10 the muladd operation have been implemented with “mad.f32” and with sm=20 with “fma.rn.f32”. They are actually the same…

However, digging in the ptx manual I have found this note (pag 88, ptx2.0) - that I do not know whether can explain the difference:
For .target sm_20:
mad.f32 computes the product of a and b to infinite precision and then adds c to
this product, again in infinite precision. The resulting value is then rounded to single
precision using the rounding mode specified by .rnd.
mad.f64 computes the product of a and b to infinite precision and then adds c to
this product, again in infinite precision. The resulting value is then rounded to
double precision using the rounding mode specified by .rnd.
mad.{f32,f64} is the same as fma.{f32,f64}.
For .target sm_1x:
mad.f32 computes the product of a and b at double precision, and then the
mantissa is truncated to 23 bits, but the exponent is preserved. Note that this is
different from computing the product with mul, where the mantissa can be rounded
and the exponent will be clamped. The exception for mad.f32 is when c = +/-0.0,
mad.f32 is identical to the result computed using separate mul and add
instructions. When JIT-compiled for SM 2.0 devices, mad.f32 is implemented as a
fused multiply-add (i.e., fma.rn.ftz.f32). In this case, mad.f32 can produce slightly
different numeric results and backward compatibility is not guaranteed in this case.
mad.f64 computes the product of a and b to infinite precision and then adds c to
this product, again in infinite precision. The resulting value is then rounded to
double precision using the rounding mode specified by .rnd. Unlike mad.f32, the
treatment of subnormal inputs and output follows IEEE 754 standard.
mad.f64 is the same as fma.f64.

So the instructions are the same, but both works differently in the 2 cases. Is it reasonable it is responsible for such a drop in performance?
(I’ll try to dig more, but I hope somebody in the forum is able to help)

But aren’t all device functions inlined by default? The index function only returns

idx = x + y * DATA_W + z * DATA_W * DATA_H;

I’m more leaning towards that there is some problem with the memory transactions, I think I’ve read somewhere that the memory transactions in Fermi always are 128 bytes, and thereby all my 32 bytes read would be 4 times slower?

It is not, because it gets translated to fma.rn.*.f32 either way as explained above. The only difference is the Flush-subnormals-To-Zero flag, but it has no impact on performance.

Sot it has to be something else…

They are inlined, but parameters are passed in global memory.

Just try and write it inlined… they was treated differently in the 2 cases, some parameters, DATA_*, were stored in register outside the loop in one case… I don’t remember which one, because I tried with a dumb function - not your one that I didn’t know yesterday night.

But then the code should be equally slow without “arch=sm_20”…

No difference if it’s inlined.

should… let’s try this empiric experiment then :)

This is probably a long shot, but could it be related to a difference in integer operators for 1.x and 2.0 devices?

I didn’t check ptx code, but is it possible that the end condition of the loop is recalculated for each iteration, so the division by 2 has to be performed each time (assuming the compiler didn’t optimize it with a bit shift)?

Could you try replacing DATA_H * DATA_D with a new precalculated constant DATA_HD to avoid the integer multiplication as I’ve noticed it’s used a lot in your code (again I’m assuming it wasn’t optimized by the compiler…)

N.

Hmm I moved the index calculation outside the loop, set it to y * DATA_W + z * DATA_W_times_DATA_H and then just incremented it by one in each iteration, that made the kernel 2 times faster! Seems very strange…

I forgot to add (FILTER_SIZE - 1)/2 to the idx, when I added that back the kernel got 2 times slower again, seems like offset reads are much more expensive if the code is compiled with “arch=sm_20”…since I get a lot more offset reads if I add the offset to the idx.

Hmm again, tried to loop from 0 instead but then I got the slow kernel again, can’t understand this.

If you think the problem is due to 128 byte loads when these used to be 32-byte transactions on GT200, then try “-Xptxas -dlcm=cg” compiler flag (without the quotes). I think it’s described somewhere in the programming guide. It should essentially enable 32-byte transactions for loads.

Just another thought.
You’re sure using a lot of per-thread variables e.g. float A_matrix_2D_value[10], h_vector_2D_value[4];
I wouldn’t be surprised if some of them end up in local memory, I’d suggest putting them in shared memory if your block size isn’t too large.

N.

I’ll rewrite the kernel in another way, such that each thread calculates one 3D value and then I do a reduction for each parameter.

From the code posted so far, I don’t see any obvious reason for a 4x slowdown. Would it be possible to post a self-contained, runnable mini app based on the kernel previously posted?

I messed up with registers in the ptx’s and so the global memory accesses I though to be parameters of Calculate_3D_Index_Fermi was just the translation of:

        phase_difference = Phase_Differences[idx].x;

        phase_gradient = Phase_Gradients[idx].x;

        certainty = Certainties[idx].x;

Sorry for that!

could it be responsible for a 4x difference? It could if the 3 arrays are aligned differently each other - giving less bank conflicts in one case, more in the other. Your idea of optimizing for coalescing accesses sounds very good, and would solve the issue.

However there are no more memory accesses in the ptx, inside that loop - Good work!

I divided the kernel into three kernels (x,y,z) and changed the float4 pointers to float, now I get the same performance with sm_20…