flops counter for number theoretic transforms


I have a fresh implementation of a Number Theoretic transform on CUDA (i.e., FFT over finite field).
which I want to benchmark.

I looked through Vasily Volkov’s code and found that he computes Gflop/s as follows:

(5 * n log n / 10^9) * batch / sec

where n is an FFT size, batch is a # of parallel runs and sec is elapsed time in seconds

Taking into account that NTT is real valued transform I wonder if some of FFT “gurus” knows the correct theoretical bound for
NTTs which can be used to estimate the performance