From each block size N to next size 2*N Cumulative, relative to N = 64 size
N | Tavg, ns| N*log2(N)| Expected Time Factor | Actual Time Factor |Expected Time Factor|Actual Time Factor
64| 2.138 |384| | |1.000| 1.00|
128| 2.329 |89| 2.333| <b>1.09</b>| 2.333| 1.09|
256| 4.320 |2048| 2.286| <b>1.85</b>| 5.333| 2.02|
512| 21.912| 4608| 2.250| <b>5.07</b>| 12.000| 10.25|
1024| 29.925| |10240| 2.222| <b>1.37</b>| 26.667| 14.00|
2048| 61.366 | 22528| 2.200| <b>2.05</b>| 58.667| 28.71|
4096| 80.729 |49152| 2.182| <b>1.32</b>| 128.000| 37.77|
8192| 166.432| 106496| 2.167| <b>2.06</b>| 277.333 77.86|
Sorry for the wrong allignment of the table, I couldn’t find an option for table insert
This table shows average time each FFT size takes to compute, its order of complexity, expected time factor between consequtive sizes(calculated as 2Nlog2(2N)/Nlog2(N) ), actual time factor( calculated as Tavg(2N)/Tavg(N)) and cumulative time factors.
The time would be expected to increase by a little more than 2x factor from each block size N to next larger block size 2N, since the calculations for an FFT are proportional to Nlog2(N). However, the results show that it is not increasing this much. So for some reason, the GPU is more efficient at the larger FFT block sizes than the smaller ones.What is the reason for this pattern,
Does this have something to do with the CUDA block size and GRID size it is using? I noticed that the time increased by 5x from FFT size 256 to FFT size 512, and this is the same FFT size at which CUDA block size changed from (16,16,1) to (64,2,1).