I’m using the cuFFTDx library and noticed an increase in the amount of shared memory from version 0.3.0 to 1.1.0.
FFT is being declared in the following fashion:
#include <cufftdx.hpp>
#include <stdio.h>
using namespace cufftdx;
int main() {
// base FFT for comparison
using FFT = decltype(Size<8192>() + Precision<float>() + Type<fft_type::r2c>()
+ Direction<fft_direction::forward>() + FFTsPerBlock<1>() + ElementsPerThread<32>()
+ SM<700>() + Block());
...
printf("%d\n", FFT::shared_memory_size);
}
Compiling with cuFFTDx 0.3.0 I get the following memory usages:
Reference memory usage (SM<700>, Size<8192>, ElementsPerThread<32>, Type<fft_type::r2c>)
32768
ElementsPerThread<x> (SM<700>, Size<8192>, Type<fft_type::r2c>)
ElementsPerThread<32>: **32768**
ElementsPerThread<16>: 32768
ElementsPerThread<8> : 32768
Size<x> (SM<700>, ElementsPerThread<32>, Type<fft_type::r2c>)
Size<2048> : 16384
Size<4096> : 16384
Size<8192> : 32768
Size<16384>: 65536
Size<x> (SM<800>, ElementsPerThread<32>, Type<fft_type::r2c>)
Size<2048> : 16384
Size<4096> : 16384
Size<8192> : 32768
Size<16384>: 65536
Size<32768>: 131072
With cuFFTDx 1.1.0 (MathDx 22.1) I get the following memory usages:
Reference memory usage (SM<700>, Size<8192>, ElementsPerThread<32>, Type<fft_type::r2c>)
65536
ElementsPerThread<x> (SM<700>, Size<8192>, Type<fft_type::r2c>)
ElementsPerThread<32>: **65536**
ElementsPerThread<16>: 32768
ElementsPerThread<8> : 65536
Size<x> (SM<700>, ElementsPerThread<32>, Type<fft_type::r2c>)
Size<2048> : 16384
Size<4096> : 32768
Size<8192> : 65536
Size<16384>: 65536
Size<x> (SM<800>, ElementsPerThread<32>, Type<fft_type::r2c>)
Size<2048> : 16384
Size<4096> : 32768
Size<8192> : 65536
Size<16384>: 131072
Size<32768>: 131072
The doubled memory usage of size 8192 FFTs was the initial cause of concern (bolded). From this I also found unexpected differences depending on the number of elements per thread in cuFFTDx 1.1.0 as well as differences depending on the architecture. We’d like to better understand the cause of the increase in memory usage across versions, as well as the unexpected behavior with ElementsPerThread.
I isolated the issue and generated these tests, .cu file is attached. I compiled against separate versions of cuFFTDx in a Makefile.
BUILD INFO:
gcc 9.2.0
nvcc 11.2
shared_mem.cu (3.5 KB)