Since it is from 2016, I’d like to know:
1 - Is it already incorporated in the toolkit? I’m currently using 9.1.
2 - In either case, is sincosf() a direct replacement for __sinf() and __cosf() in this situation:
You can certainly replace the __sinf() and __cosf() in your code with a call to sincosf(). But the basic trade-offs regarding use of intrinsics remain:
sincosf() provides accurate results across the entire possible range of ‘float’ inputs. __sinf() and __cosf() will provide not quite so accurate results on the unit circle, and quantization artifacts from the underlying fixed-point computation may be apparent in some use cases. The accuracy of the intrinsics gets worse as the arguments increase in magnitude.
2_ Use of __sinf() and __cosf() will be much more efficient (should be three instructions altogether: an RRO.SINCOS range reduction instruction followed my MUFU.SIN and MUFU.COS. An accurate sincosf() implementation on the other hand requires on the order of ten times as many instructions. Since your code is memory bound however, that computational efficiency should have little to no bearing on the performance of the kernel.
The easiest way to assess the trade-offs in the context of your use case is to simply try it and profile the resulting code, independent of the underlying implementation of sincosf(). You can also compare CUDA’s built-in sincosf() with the code I posted. That’s a ten-minute experiment altogether.
This kernel takes more or less the same time as other kernels to run, working exactly on the same amount of data and also memory-bound.
Is it something I should worry about, or there is nothing to be fixed here?
That’s a good indication that there’s nothing to worry about.
As for divergent branches, the sqrtf() implementation certainly uses some branches, although I would not expect that much divergence to occur with those unless your data is exercising the full spectrum of ‘float’ operands. In practice, most use cases involve operands to sqrt() that distribute fairly closely around 1.0, and very little divergence should occur.
I am using values between -1 and +1 and the imaginary part could be as you say, using the full spectrum of float. At some point I get a lot of NaN out of the computation, which could justify this much divergence from sqrtf().
My equation was also incorrect, quadrature should be squared (fixed now), so there is certainly more to inspect. But now I have enough information to move forward.
That is a plausible explanation for the observed high percentage of divergent branches, because NaNs are handled by the “slow path” of the sqrtf() implementation.
Da nich’ für. [regional Northern German for: Don’t mention it]