I am not familiar with that command. Here is the code I am compiling and the generated PTX (remember, I have the fast math option set):
The rcp.approx source code:
#include <cuda.h>
__global__ void kernel(float param)
{
volatile float temp = param;
for (int a = 0; a < 100000; a++)
temp = 1/temp;
}
int main()
{
kernel<<<dim3(1), dim3(1)>>>(3.14159f);
}
And the PTX code generated:
1>ptxas info : Compiling entry function ‘_Z6kernelf’ for ‘sm_10’
1>ptxas info : Used 1 registers, 4+16 bytes smem, 4 bytes cmem[1]
Runs in 6.089 milliseconds
.entry _Z6kernelf (
.param .f32 __cudaparm__Z6kernelf_param)
{
.reg .u32 %r<4>;
.reg .f32 %f<6>;
.reg .pred %p<3>;
.loc 16 11 0
$LDWbegin__Z6kernelf:
.loc 16 13 0
ld.param.f32 %f1, [__cudaparm__Z6kernelf_param];
mov.f32 %f2, %f1;
mov.s32 %r1, 0;
$Lt_0_1794:
//<loop> Loop body line 13, nesting depth: 1, iterations: 100000
.loc 16 16 0
mov.f32 %f3, %f2;
rcp.approx.f32 %f4, %f3; // <---------- This is the line of interest.
mov.f32 %f2, %f4;
add.s32 %r1, %r1, 1;
mov.u32 %r2, 100000;
setp.ne.s32 %p1, %r1, %r2;
@%p1 bra $Lt_0_1794;
.loc 16 17 0
exit;
$LDWend__Z6kernelf:
} // _Z6kernelf
The div.approx source code:
#include <cuda.h>
__global__ void kernel(float param)
{
volatile float temp = param;
for (int a = 0; a < 100000; a++)
temp = __fdividef(1.0f, temp);
}
int main()
{
kernel<<<dim3(1), dim3(1)>>>(3.14159f);
}
And the PTX code generated:
1>ptxas info : Compiling entry function ‘_Z6kernelf’ for ‘sm_10’
1>ptxas info : Used 1 registers, 4+16 bytes smem, 4 bytes cmem[1]
Runs in 6.089 milliseconds
.entry _Z6kernelf (
.param .f32 __cudaparm__Z6kernelf_param)
{
.reg .u32 %r<4>;
.reg .f32 %f<7>;
.reg .pred %p<3>;
.loc 16 11 0
$LDWbegin__Z6kernelf:
.loc 16 13 0
ld.param.f32 %f1, [__cudaparm__Z6kernelf_param];
mov.f32 %f2, %f1;
mov.s32 %r1, 0;
$Lt_0_1794:
//<loop> Loop body line 13, nesting depth: 1, iterations: 100000
.loc 16 16 0
mov.f32 %f3, 0f3f800000; // 1
mov.f32 %f4, %f2;
div.approx.f32 %f5, %f3, %f4; // <---------- This is the line of interest.
mov.f32 %f2, %f5;
add.s32 %r1, %r1, 1;
mov.u32 %r2, 100000;
setp.ne.s32 %p1, %r1, %r2;
@%p1 bra $Lt_0_1794;
.loc 16 17 0
exit;
$LDWend__Z6kernelf:
} // _Z6kernelf
The two programs compile almost identical source code, save for the division part. They run in the same amount of time. I show all of the digits of their timing that don’t vary between runs (I don’t consider a variance of 0.1 microseconds to be significant). I compiled and ran both programs for SM_10, SM_11, SM_12, SM_13, and SM_20. They all do exactly the same thing. Notice that my comment above about SM_20 not working was a mistake on my part. I accidentally left out the fast math option. So what is going on? Why do we have the rcp command if it doesn’t save time???