Is there a speed difference between div.approx and rcp.approx?

I ran some tests and it seems that there is no time difference between these two commands. I generated the two by using these commands (with fast math enabled):

volatile x, y;

...

y = __fdividf(1.0f, x); //div

and

y = 1/x; //rcp

I was running code that took tens of milliseconds so I am sure the loops weren’t being optimized away. I looked at the ptx and I ensured that the two commands were there. There was no real difference in timing. However, when I compiled the code to 2.0, (as opposed to 1.0), suddenly, the register usage jumped up and the div.approx was much faster. I am not sure what happened there but I don’t think that is related to the question at hand.

Anyway, if div is no slower, what is the point of rcp?

is the _frcprn,rz,ru,rd you are using?

I am not familiar with that command. Here is the code I am compiling and the generated PTX (remember, I have the fast math option set):

The rcp.approx source code:

#include <cuda.h>

__global__ void kernel(float param)

{

	volatile float temp = param;

	

	for (int a = 0; a < 100000; a++)

		temp = 1/temp;

}

int main()

{

	kernel<<<dim3(1), dim3(1)>>>(3.14159f);

}

And the PTX code generated:

1>ptxas info : Compiling entry function ‘_Z6kernelf’ for ‘sm_10’

1>ptxas info : Used 1 registers, 4+16 bytes smem, 4 bytes cmem[1]

Runs in 6.089 milliseconds

.entry _Z6kernelf (

		.param .f32 __cudaparm__Z6kernelf_param)

	{

	.reg .u32 %r<4>;

	.reg .f32 %f<6>;

	.reg .pred %p<3>;

	.loc	16	11	0

$LDWbegin__Z6kernelf:

	.loc	16	13	0

	ld.param.f32 	%f1, [__cudaparm__Z6kernelf_param];

	mov.f32 	%f2, %f1;

	mov.s32 	%r1, 0;

$Lt_0_1794:

 //<loop> Loop body line 13, nesting depth: 1, iterations: 100000

	.loc	16	16	0

	mov.f32 	%f3, %f2;

	rcp.approx.f32 	%f4, %f3;    // <----------  This is the line of interest.

	mov.f32 	%f2, %f4;

	add.s32 	%r1, %r1, 1;

	mov.u32 	%r2, 100000;

	setp.ne.s32 	%p1, %r1, %r2;

	@%p1 bra 	$Lt_0_1794;

	.loc	16	17	0

	exit;

$LDWend__Z6kernelf:

	} // _Z6kernelf

The div.approx source code:

#include <cuda.h>

__global__ void kernel(float param)

{

	volatile float temp = param;

	

	for (int a = 0; a < 100000; a++)

		temp = __fdividef(1.0f, temp);

}

int main()

{

	kernel<<<dim3(1), dim3(1)>>>(3.14159f);

}

And the PTX code generated:

1>ptxas info : Compiling entry function ‘_Z6kernelf’ for ‘sm_10’

1>ptxas info : Used 1 registers, 4+16 bytes smem, 4 bytes cmem[1]

Runs in 6.089 milliseconds

.entry _Z6kernelf (

		.param .f32 __cudaparm__Z6kernelf_param)

	{

	.reg .u32 %r<4>;

	.reg .f32 %f<7>;

	.reg .pred %p<3>;

	.loc	16	11	0

$LDWbegin__Z6kernelf:

	.loc	16	13	0

	ld.param.f32 	%f1, [__cudaparm__Z6kernelf_param];

	mov.f32 	%f2, %f1;

	mov.s32 	%r1, 0;

$Lt_0_1794:

 //<loop> Loop body line 13, nesting depth: 1, iterations: 100000

	.loc	16	16	0

	mov.f32 	%f3, 0f3f800000;     	// 1

	mov.f32 	%f4, %f2;

	div.approx.f32 	%f5, %f3, %f4;    // <----------  This is the line of interest.

	mov.f32 	%f2, %f5;

	add.s32 	%r1, %r1, 1;

	mov.u32 	%r2, 100000;

	setp.ne.s32 	%p1, %r1, %r2;

	@%p1 bra 	$Lt_0_1794;

	.loc	16	17	0

	exit;

$LDWend__Z6kernelf:

	} // _Z6kernelf

The two programs compile almost identical source code, save for the division part. They run in the same amount of time. I show all of the digits of their timing that don’t vary between runs (I don’t consider a variance of 0.1 microseconds to be significant). I compiled and ran both programs for SM_10, SM_11, SM_12, SM_13, and SM_20. They all do exactly the same thing. Notice that my comment above about SM_20 not working was a mistake on my part. I accidentally left out the fast math option. So what is going on? Why do we have the rcp command if it doesn’t save time???

In most instances, the reciprocal is faster than the equivalent division, which is why PTX provides a separate instruction for it. For example, compare the execution time of __frcp_rn() with __fdiv_rn(), or __drcp_rn() with __ddiv_rn().

In the simplest case, div.approx maps to a reciprocal instruction followed by a multiplication with the dividend at the machine code level. It is likely that the compiler recognizes that the multiplication in this case is by 1.0f, and therefore reduces the code to just the reciprocal instruction, which means the same machine code is produced either way. You can check for yourself by disassembling the code with cuobjdump.

Your observations when switching from for sm_1x to sm_2x are explained by the fact that when building for sm_1x, the single-precision division operator ‘/’ maps to an approximate division or reciprocal, either div.full.f32 or rcp.approx.f32 in PTX, depending on whether it is actually a reciprocal (meaning the dividend is 1.0f) or a true division. For sm_2x on the other hand, the compiler defaults to single-precision division rounded according to IEEE-754 rules, meaning you will see div.rn.f32 and rcp.rn.f32 at the PTX level. The IEEE-754 compliant operations require additional instructions and registers compared to the corresponding approximate versions. When building for sm_2x you can use compiler switch -prec-div={true|false} to control whether the single-precision division operator should be translated into approximate or IEEE-rounded operations.

Is there a way to call frcp.approx from cuda? Is the throughput different between __frcp_rn and frcp.approx? Thanks!

__frcp_rn() maps to rcp.rn.f32, which is a single-precision reciprocal rounded according to IEEE-754 rules. rcp.approx.f32 is an approximate single-precision reciprocal, which in case of the .ftz variant also happens to map to a single hardware instruction on current hardware. Since it takes a fair number machine instructions to implement rcp.rn.f32 (regardless of FTZ mode or not) it follows that its throughput is generally lower than that of rcp.approx.f32.

rcp.approx.f32 is generated for 1.0f/x when compiling for sm_1x, or when compiling for sm_2x with -prec-div=false. Like any other PTX instruction it can be generated directly with inline PTX assembly as well.

I appreciate the attention everyone has given to this topic. However, I do ask that you actually read the posts so that you don’t accidentally give answers that are clearly not related.

In general, the reciprocal is faster than division, but I am not asking about all cases in general. I am asking about a very specific function only. The fact that there are other commands out there are faster than other commands is not relevant here. Why does rcp.approx exist when it offers no benefits? The question still stands.

Please read my second post above. I already checked that the assembly was correct and that it wasn’t just compiling to the same ptx code. In fact, I even posted the generated code so there could be no mistake.

I’m sorry but my observations when switching from for sm_1x to sm_2x are explained by my forgetting the fast math option. The code was compiling to the wrong assembly. It was just a silly mistake on my part, nothing more. I explained this in my second post, which you really should give a read as your explanation here is not at all related.

So, once again I am left asking. Why does rcp.approx exist? Is it just for the sake of completeness and nothing more? Or is there another more profound reason that I am missing?

Best I can tell my answer with regard to -prec-div=false spoke to the issue, as -use_fast_math implies -prec-div=false for sm_2x. The fact that -use_fast_math also controls a number of other things did not seem relevant to the issue at hand.

I suggested to look at the machine code (which is not the same as PTX code) to check whether both rcp.approx and div.approx with a dividend of 1.0f compile to the same machine code. It is likely that they do in this case due to optimizations performed by PTXAS.

rcp.approx may not always result in the same machine code as a div.approx with a dividend of 1.0f. This probably happens for the case discussed here, but is not necessarily so. For example, it probably does not happen when the program is compiled in debug mode, and it may not happen for sm_2x if the .ftz suffix is not used. Note: I have not checked either of those scenarios. Also, when the compiler decides to generate a reciprocal operation, it may not yet know whether this will be a rcp.approx or a rcp.rn, as that is not relevant until PTX-generation time. Having rcp.approx makes the PTX instruction set more orthogonal, which is something compilers (or rather, compiler writers) appreciate.