Hi!
I’m to write a benchmark suite for CUDA, including fft, LU decomposition, sparse solver and bitonic sort. I’m wondering where all these GFLOP counts come from? How do i measure GFLOPS for each of the tasks? Do I have to know the implementation of those algorithms to calculate it?
Many of those functions you mention have known (or at least bounded) theoretical algorithmic complexity. That is how classic benchmarks like the LINPACK benchmark work - the FLOP value is based purely on the theoretical operation count and the execution time. The actual implementation is not considered.