Look-Up Table vs __sincosf for Large-Scale Random Phase Calculations in Radio Astronomy Pipeline

Memory Types Explored:

  1. Constant memory: Works well for small datasets (<2M points) but degrades significantly for 64M+ due to serialized access when threads read different addresses

  2. Texture memory: Similar performance to constant memory; no improvement for random access patterns

  3. Global memory: Standard cached reads; still slower than __sincosf at scale

  4. __ldg() intrinsic: Read-only cache; no measurable difference

Performance Results (K40 GPU):

Sample Size LUT Time (μs) __sincosf Time (μs) Ratio
1,024 18 36 0.50x
2M 394 398 0.99x
64M 11,907 8,038 1.48x
400M 70,898 49,623 1.43x
  1. Is there a fundamental reason why __sincosf() outperforms LUT for large random-access patterns? My hypothesis is that the intrinsic benefits from hardware-level optimizations (SFU units?) that outweigh memory latency at this scale.
  2. Are there any memory access patterns or caching strategies I haven’t considered that could make LUT competitive for this use case?

At what point should I accept that __sincosf() is simply the right tool for this job? NVIDIA has clearly invested significant optimization effort into these intrinsics. Any approach that achieves <0.9x the execution time of __sincosf() for 64M+ random phase values while maintaining <0.002 radian error would be valuable. I appreciate any insights from the community on whether I’m hitting fundamental performance limits or if there’s an approach I haven’t considered.

You don’t mention the size of your LUT.

You could try loading the LUT into shared memory. You will still run into serialisation there, but the impact will be less than that under constant memory.

Some idea of the difference can be seen comparing Figure 3.11 with Figure 3.9 for the K80 in this paper. The situation is somewhat better with later model architectures.

I have two lookup tables for sin and cos each. both store 1440 floats i.e 5760 bytes. Both lookup tables are called inside the kernel. Below is the general structure of the LUT kernel:

int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx < N) {

int lut_idx = (int)(phases[idx] *constant_multiplication_factor );

sin_out[idx] = sin_lut[lut_idx];

cos_out[idx] = cos_lut[lut_idx]; }

whereas the other kernel simply calls __sincosf once for phases[idx].

Shared memory is actually performing worse than constant memory because we have 400M data points (phases) and hence a large no of block will be created. The most optimal performance is acheived at blocksize = 128 currently.

sine and cosine act like linear or quadratic polynomials depending on the phase.

So you can directly calculate a first approximation and just load some correction value from the LUT.

Can you reduce the LUT size sufficiently to use shared memory and store the LUT 32 times to avoid bank conflicts? → One LUT per lane

I’ve tried several approaches to see if LUTs could compete with __sincosf on my NVIDIA A100 GPU (Ampere architecture):

  • Shared memory LUT (1440 points, ~5.6 KB): Fits easily in the 164 KB shared memory per SM, but performance was worse than __sincosf. The issue is that threads in a warp often access different LUT entries, which causes bank conflicts and serialization.

  • Constant memory LUT: Also tested, but since each thread reads a different entry, accesses serialize. Constant memory only shines when all threads read the same value (broadcast).

  • Taylor polynomial approximation (up to 3 terms): Tried to reduce LUT size by approximating sine/cosine with low‑order polynomials. However, the extra FMAs and range reduction overhead made this slower than __sincosf.

  • Replication trick (one LUT per lane): Considered replicating the LUT 32× in shared memory to avoid bank conflicts. But my 1440‑point table would balloon to ~180 KB, which exceeds the 164 KB per SM budget.

  • Result: Even though the LUT fits in fast memory, __sincosf consistently outperforms these methods on A100. The SFU units are deeply pipelined, so their latency is hidden across warps, while LUT approaches remain memory‑bound.

So far, every attempt (shared, constant, polynomial + LUT, replication) has been slower than simply calling __sincosf on A100. But replicating to avoid bank conflicts is a really clever trick! I appreciate the help, thank you!

You could also try out a smaller replication factor for shared memory. 16x replication (2x bank conflict) will easily fit on A100.

Or use ~1312 points

What throughput do you achieve per SM? How many sincos per cycle?

You could also use Tensor Cores for polynomials without LUT.

Thank you both for the suggestions. After analyzing the cycle costs, I believe I’ve identified the fundamental issue. Below are the throughput measurements:

  • __sincosf: ~0.72 sincos/cycle/SM (~8 cycles per operation)

  • LUT (global): ~0.50 sincos/cycle/SM (~12 cycles per operation)

I think the bottleneck is index computation, not memory access. Even with shared memory and banking tricks, each thread must compute:

int lane = threadIdx.x % 16;     // ~20 cycles (modulo)
int deg = (int)(c * rd[idx]);    // ~6 cycles (multiply + cast)

This alone exceeds the ~8 cycle cost of __sincosf. The hardware SFU units are simply faster than the arithmetic required to compute array indices. I appreciate the suggestions - they helped me understand the fundamental performance limits. For my radio astronomy application, __sincosf is the better option it seems now.

Each thread should compute more than one sincos so the modulo is done only once for all. The % 16 uses AND instead of division as it is a power of 2.

(there are 32 lanes, not 16)

How do you compute cycles per operation? The A100 has 16 SFU/SM or 4 per SM partition.

It is very well possible that __sincosf is the fastest you can get.

If you have 2 ways to compute it, you can potentially use both in parallel for more performance than one way alone.

But your actual bottleneck should be loading the angles or phases. Or are they computed?

@Curefab

16 is used with 16 x LUT replication.

@saumyag9305

Have you tried Norbert’s sincospi() implementation, with, “constant_multiplication_factor” replacing pi, (assuming you aren’t using pi)?

The current __sincosf() version could have advanced since he wrote it of course.

16xLUT would mean, each shared memory access needs 2 cycles. With 4 SM partitions, each partition could do one (4 byte) read every 8 cycles.

So to surpass __sincosf it should be 32xLUT.

Any time your code is not bottlenecked by MUFU operations and is not negatively affected by the quantization effect on results near zero. It can be instructive to examine what __sincosf() does at the machine code (SASS) level:

FMUL.RZ R2, R2, 0.15915493667125701904 
MUFU.SIN R3, R2 
MUFU.COS R0, R2 

Minimal static and dynamic instruction count, minimal register pressure. All the MUFU instructions are based on table lookup with quadratic interpolation, but the table is in a dedicated ROM and therefore does not interact with normal memory operations. The interpolation uses dedicated fixed-point arithmetic that uses the narrowest width operations it can get away with. This approach creates an observable quantization effect for small results, e.g. sine near zero drops to zero before the argument even gets into subnormal territory, instead of smoothly approaching via sin(x) ~= x. Any custom-tailored LUT approach would suffer from similar quantization.

The dedicated hardware is obviously the main reason that MUFU operations cannot be offered at the full throughput of regular FP32 arithmetic: this would simply utilize too many transistors and thus silicon real estate. Silicon that would be mostly idle in many use cases. Traditionally MUFU instructions executed with a throughput of 1/4 of FP32 FMA, on the most recent architectures this has been reduced to 1/8 of FFMA throughput. Unless this lower throughput becomes a bottleneck (because of a dense accumulation of MUFU operations in hotspots), there is no point in trying to replace __sincosf() with a manually crafted solution.

If one got into that situation, it would be difficult to improve performance while maintaining the overall functionality of the intrinsic. If, however, there are special circumstances such as a restriction to certain angles known in advance, or evaluation for multiple angles in constant angle increments, or arguments multiplied by π, then it might be worthwhile to look into the issue.

Interesting. So the hardware implementation had not been introduced back when you wrote sincospif().

Not sure what you mean. On the earliest CUDA-capable GPUs MUFU was called SFU, but the table lookup with quadratic interpolation was there from the beginning (see 2005 publication by Oberman and Siu), as was the exposure of MUFU operations via device intrinsics.

Some of the details have changed: MUFU gained a SQRT sub-op around Pascal and a TANH sub-op around Ampere, and necessary argument reductions were (mostly) pulled from a separate RRO.{EX2|SINCOS} instruction into the MUFU operations themselves around Volta. Leaving the FMUL in the code shown as the left-over part that did not fit into the MUFU itself.

sincospif() is not a device intrinsic, just a regular CUDA math library function, and I added it fairly late, around 2011, I think? I forget which use case prompted that addition, but trig functions with argument multiplied by π are fairly common, and the reduction in code size and register usage (and improved accuracy, in some scenarios) warranted the addition.

We had a recent discussion here about the addition of a __sincospif() device-side intrinsic, which I suggested the relevant poster request from NVIDIA by RFE, but I don’t think that has been added yet? I am still on CUDA 12.8 and don’t know what has been added since then.


Stuart F. Oberman and Michael Y. Siu: “A High-Performance Area-Efficient Multifunction Interpolator.” In 17th IEEE Symposium on Computer Arithmetic, June 2005, pp. 272-279

A misunderstanding on my part. I misinterpreted the sincos() in:

“Compared with sin(), cos(), and sincos() performance and accuracy are improved, while code size and register pressure are reduced.”

with the intrinsic __sincosf().

while maintaining <0.002 radian error

I am assuming that is the acceptable error for the result, not for the function argument.

=======

Full Taylor approach:

If you stay within +/-45° this error bound can be achieved without lookup with

sin x ~ x - x³/6

cos x ~ 1 - x²/2 + x^4/24

You can expand to the full circle with swapping and negation.

=======

6 FMA operations should be enough to get both sin x and cos x from x in the interval.

=======

The requested accuracy would work with half16.

The A100 has double the FMA performance for half16 vs. 32-bit float.

=======

Even if you reach only half the speed of __sincosf you can run both approaches in parallel to possibly get 1.5x the speed.

=======

Small LUT approach:

With a small LUT 32x in shared memory you could use coarse angles and correct with fine angles (rotations) to massively reduce the needed LUT size.

sinfine = sincoarse + coscoarse * deltaphi

cosfine = coscoarse - sincoarse * deltaphi

That would be 2 FMAs with half16.

Taking it from a Taylor approximation to a near-minimax approximation, one would have something like this:

// x in [-pi/4, +pi/4] 
float x2 = x * x;
float sin_x = fmaf (-1.62427915e-1f * x2, x, x);
float cos_x = fmaf (fmaf (-4.99934056e-1f, x2, 4.08139193e-2f), x2, 1.0f);

This would result in 3 FFMAs, 2 FMULs, and 1 MOV-type op. The mapping to the full circle would probably more than double this instruction count, but I have not worked out the details of that.

Or __hfma2 for computing two half2 fma at the same time.

If the surrounding code is already using packed FP16 data, that’s the way to go, but otherwise it creates additional overhead for packing and unpacking at the interface. I am not a friend of explicit SIMD these days, but if one wants to go down that path for Ninja-style code …

Thank you all SO much for this incredibly detailed discussion! Special thanks to @njuffa - having the actual author of __sincosf explain the implementation is beyond what I could have hoped for.

Addressing all the points:

The phases are loaded from global memory (rd[idx]), not computed. They represent geometric delays that are pre-calculated by the telescope control system. So yes, memory bandwidth could be a limiting factor.

On measurement methodology, I used cudaEventRecord() with cudaEventElapsedTime() for kernel timing across K40, A100, and T4 GPUs. Tested various memory types (constant/texture/global) with sample sizes from 1K to 400M data points.

On the small LUT + fine angle correction, this is fascinating! If I understand correctly: store ~32 coarse angles in shared memory, then apply 2-FMA rotation corrections for the fine angle difference. This would avoid the large LUT memory access patterns entirely. Would need to implement and benchmark, but the memory access pattern would be much friendlier. However, I still think __sincosf would beat it in speed for as large as 400M samples.

Thank you all again for the detailed technical insights. I now have some understanding of GPU architecture and performance optimization. Even though the LUT approach didn’t work out, the investigation was valuable—both for my project and for documenting why __sincosf is the right choice for this type of workload.

While LUTs work well on CPU (0.556x speedup), and also for a lesser amount of data in GPUs, GPU architecture fundamentally changes the tradeoffs - dedicated hardware beats memory-based approaches is what I now think.