sad(x,y,z) and the other Integer Functions speed&use

I noticed that __mul24(x,y) is quite fast.

But I didn’t notice any differences with __sad(x,y,z)(Sum of Absolute Difference). So is it useful to replace additions with it? Maybe even in loops?

for (int i=10; i>0; i–)=> for (int i=10; i>0; __sad(i,1,0))

I am not a mathematician. So I don’t see much use for the other Integer Functions(Maybe in counters?). Or is there a way to integrate __clz, __ffs, __popc, etc., easily in order to speed up code?

replacing the addition by sad you do not gain anything in terms of speed

but in general sad(x,y,z) seems to be a very useful instruction (if one can find a good application for it) because it performs 3 arithmetic operations at once…

The programming guide does not say anything about its speed but according to my tests looks like it’s executed in 4 clock cycles per warp (special hardware ?)

From actual measurements, sad() is a single-clock instruction, just like a simple add.

Thanks a lot to you both!

I am especially happy about this thread:

Nvidia should have published these information. Because a lot of things are missing in their guides. And sometimes it is very badly explained(I am about to print the timings.txt!) But I also have to admit, that I was too lazy to write something to find it out myself :rolleyes: . So thanks!

I had a look at the ptx. And it seems that sad(x, 0,y) results in at least two operations. Because a “0” has to be created first. Therefore an addition of two numbers, variables, etc. is likely to be slower with sad than with a normal addition. However if you have 3 elements, etc., it seems to be faster. But one should not forget that __sad(a,b,c) is not a+b+c, but |a-b|+c.

// at the beginning of kernel

volatile int zero = 0.0f;

// inside your kernel

… sad(x, zero, y) …

Then the zero is created only once and re-used over and over.