in my code, I make heavy use of out-of-place real to complex and
complex to real FFTs at many different sizes. Motivated by the
release highlights which announce a significantly improved FFT
performance, I updated the CUDA toolkit from 3.1 to 3.2.
However, I found a considerably reduced performance of CUFFT 3.2
compared to release 3.1. As a test, I’ve directly compared the
runtime in a toy program with array sizes up to 8192 elements and
confirmed the finding.
Interestingly, when I profile the application using the Nvidia
Visual Profiler, a slightly improved runtime is reported
using toolkit 3.2–in contrast to what I measure when I clock the
application by myself.
Furthermore, though less of a problem for me, I’ve noticed a
somewhat higher memory consumption.
Has anyone found a similar behavior?
GeForce GTX 480
Ubuntu, 2.6.32 kernel, 64 bit