When I read the sample project of convolutionFFT2D, I found that when the matrix size is 10001000, the GPU time is 7.44ms(134.33Mpix/s), while I change the matrix size to 20002000, and the pad size also be multiple of 1024, the GPU time rise to 44ms rapidly and the computing ability is only 90.7Mpixs/s, I tried to change the grid and block parameter, but no obvious effect in improve the computing ability to more than 100Mpixs/s. So I want to know the reasons affecting the computing ability, and what should I do to get some improvement?

The first factor is general: for power-of-two signal sizes total number of memory/math operations is O(W * H * log2(W * H) ), so specific work per signal element is O(log2(W * H)) – increases as the sizes increase. log2(4M) / log2(1M) == 1.1; 10% more operations per signal element.

The second factor is CUFFT and CUDA-specific: there are different kinds of memory in CUDA. Thanks to fast on-chip shared memory, the elementary in-shared memory 1D transformations do not involve intermediate global memory loading and storing at all. But the maximum that G80’s shared memory can fit is 1024 complex elements (in the case of power-of-two vectors).

Transformations of larger power-of-two-sized vectors are combined from the ‘elementary’ 1024-point transformations: first the same elementary kernel is invoked on the elementary subvectors of the large input vector, next a “finalizing” kernel is invoked, reading and writing the whole vector from/to global memory at each iteration.

Very long 1D vectors (up to 8M points) are processed by means of Cooley-Tukey decomposition.