cuFFT profiling and anomalies in actual time factor

From each block size N to next size 2*N Cumulative, relative to N = 64 size

N |	Tavg, ns|  N*log2(N)| Expected Time Factor | Actual Time Factor |Expected Time Factor|Actual Time Factor
64|	2.138	 |384|		|	|1.000|	1.00|
128|	2.329	|89|	2.333|	<b>1.09</b>|	2.333|	1.09|
256|	4.320	|2048|	2.286|	<b>1.85</b>|	5.333|	2.02|
512|	21.912|	4608|	2.250|	<b>5.07</b>|	12.000|	10.25|
1024|	29.925|	|10240|	2.222|	<b>1.37</b>|	26.667|	14.00|
2048|	61.366 | 22528|	2.200|	<b>2.05</b>|	58.667|	28.71|
4096|	80.729	|49152|	2.182|	<b>1.32</b>|	128.000|	37.77|
8192|	166.432|	106496|	2.167|	<b>2.06</b>|	277.333	77.86|

Sorry for the wrong allignment of the table, I couldn’t find an option for table insert
This table shows average time each FFT size takes to compute, its order of complexity, expected time factor between consequtive sizes(calculated as 2Nlog2(2N)/Nlog2(N) ), actual time factor( calculated as Tavg(2N)/Tavg(N)) and cumulative time factors.

The time would be expected to increase by a little more than 2x factor from each block size N to next larger block size 2N, since the calculations for an FFT are proportional to Nlog2(N). However, the results show that it is not increasing this much. So for some reason, the GPU is more efficient at the larger FFT block sizes than the smaller ones.What is the reason for this pattern,

Does this have something to do with the CUDA block size and GRID size it is using? I noticed that the time increased by 5x from FFT size 256 to FFT size 512, and this is the same FFT size at which CUDA block size changed from (16,16,1) to (64,2,1).

It’s typical that the GPU becomes more efficient on almost any workload as you get closer to saturation. Closer to saturation means the GPU has more threads, more exposed parallelism, and more opportunity to hide latency.

Timing the GPU on ridiculously small operations such as a N=64 FFT isn’t going to be very instructive if you want to evaluate performance based on problem size. You are almost certainly running into some other performance limiter, such as latency.

But why could the actual time factor for 512 FFT more than what is expected.