What is your system configuration?
How big is the L2 norm, reported before “Test FAILED” message?
As a general note, most current FFT implementations, including CUFFT, have varying precision and performance for different transformation sizes, and it depends not only on absolute value (small-huge), but on factorization into library radix primes as well. A priori different FFT ‘plans’ mean different computation paths.
Your problems might be due to these reasons:
CPU carries out opeations in double or long double precision by default, converting results back to floats only on memory write. Double precision is not available on current NVIDIA GPUs.
Even non-transсendent floating-point computations, like convolution, can be carried out with “ultimate” precision only on limited range of input data, be it single, double, or long double precision. “High-resolution” in the input data can be another reason, RAND_MAX is only 32767 in Windows (Visual Studio), but can be much bigger in GCC, so that single vs. double precision in CPU convolution calculation makes a difference.
For #1, by means of “-ffloat-store” flag you can adjust GCC settings to make it truncate intermediate results back to single precision.
For #2, you can try to reduce the “resolution” of input data:
h_filter_kernel[i].x = rand() % 256;
h_filter_kernel[i].y = 0;
h_signal[i].x = rand() % 256;
h_signal[i].y = 0;
It’s one of the properties of single-precision floating point: if we perform additions and multiplicatons on single-precision floating-point data having integer values, and result and operands for each operation is less than 16M by absolute value, end result is calculated with “infinite” precision.