Shared memory size of cuFFTDx: 0.3.0 vs 1.1.0

I’m using the cuFFTDx library and noticed an increase in the amount of shared memory from version 0.3.0 to 1.1.0.

FFT is being declared in the following fashion:

#include <cufftdx.hpp>
#include <stdio.h>

using namespace cufftdx;

int main() { 
   // base FFT for comparison
   using FFT = decltype(Size<8192>() + Precision<float>() + Type<fft_type::r2c>() 
   	+ Direction<fft_direction::forward>() + FFTsPerBlock<1>() + ElementsPerThread<32>()
   	+ SM<700>() + Block());
   ...
   printf("%d\n", FFT::shared_memory_size);
}

Compiling with cuFFTDx 0.3.0 I get the following memory usages:

Reference memory usage (SM<700>, Size<8192>, ElementsPerThread<32>, Type<fft_type::r2c>)
        32768
ElementsPerThread<x> (SM<700>, Size<8192>, Type<fft_type::r2c>)
        ElementsPerThread<32>: **32768**
        ElementsPerThread<16>: 32768
        ElementsPerThread<8> : 32768
Size<x> (SM<700>, ElementsPerThread<32>, Type<fft_type::r2c>)
        Size<2048> : 16384
        Size<4096> : 16384
        Size<8192> : 32768
        Size<16384>: 65536
Size<x> (SM<800>, ElementsPerThread<32>, Type<fft_type::r2c>)
        Size<2048> : 16384
        Size<4096> : 16384
        Size<8192> : 32768
        Size<16384>: 65536
        Size<32768>: 131072

With cuFFTDx 1.1.0 (MathDx 22.1) I get the following memory usages:

Reference memory usage (SM<700>, Size<8192>, ElementsPerThread<32>, Type<fft_type::r2c>)
        65536
ElementsPerThread<x> (SM<700>, Size<8192>, Type<fft_type::r2c>)
        ElementsPerThread<32>: **65536**
        ElementsPerThread<16>: 32768
        ElementsPerThread<8> : 65536
Size<x> (SM<700>, ElementsPerThread<32>, Type<fft_type::r2c>)
        Size<2048> : 16384
        Size<4096> : 32768
        Size<8192> : 65536
        Size<16384>: 65536
Size<x> (SM<800>, ElementsPerThread<32>, Type<fft_type::r2c>)
        Size<2048> : 16384
        Size<4096> : 32768
        Size<8192> : 65536
        Size<16384>: 131072
        Size<32768>: 131072

The doubled memory usage of size 8192 FFTs was the initial cause of concern (bolded). From this I also found unexpected differences depending on the number of elements per thread in cuFFTDx 1.1.0 as well as differences depending on the architecture. We’d like to better understand the cause of the increase in memory usage across versions, as well as the unexpected behavior with ElementsPerThread.

I isolated the issue and generated these tests, .cu file is attached. I compiled against separate versions of cuFFTDx in a Makefile.

BUILD INFO:
gcc 9.2.0
nvcc 11.2

shared_mem.cu (3.5 KB)

It’s likely that the optimal configuration, and thus the default one, changed when we went from 0.3.0 to 1.0.0

And something definite? What do you mean by “optimal configuration” ? Given elements per thread, FFT size and SM, what other variables are here in play?
Is there a way to get the previous behaviour?

I would expect there to be a very good reason to double the shared memory usage from one version to another as it can mean halving the number of kernels or just thread blocks of the same kernel executing concurrently.

Just to make sure we are on the same page. I think this is the biggest issue:
Size<8192> (SM<700>, ElementsPerThread<32>, Type<fft_type::r2c>) v0.3.0: 32768
v1.1.0: 65536

The SMEM increase is due to an database optimization to choose the optimal implementation. The expectation is that the new configuration is more performant. If you want to reduce the amount of shared memory used, you should modify the number of ElementsPerThread.

By modify you mean reduce it to 16 in this case. But this also has its implication - it will need twice as many threads in thread block.

How do i get to the old behaviour using new version? Clearly the library is making a decision about a trade off that might not be beneficial in my case (and for other device functions that are part of the same kernel).

Choosing the implementation is not configurable by the user at this moment. I’ll will pass this request along to product management.

cuFFTDx will not offer ability to choose implementation in “traditional” sense (we can do changes to algorithm as long it is within parameters prescribed by the user). We will have operator that would allow to express preference for shared memory usage/limits (no ETA).

For workaround to get previous behavior - please swap lines 292 and 293 in 1.1.0/include/database/records/700/database_fp32_fwd.hpp.inc

currently the file reads:

    block_fft_implementation<32,   32,  256,     1, 65536,   110>,
    block_fft_implementation<32,   32,  256,     1, 32768,   111>,

should read as follows to get you previous behavior:

   block_fft_implementation<32,   32,  256,     1, 32768,   111>,
   block_fft_implementation<32,   32,  256,     1, 65536,   110>,

Best regards,
Lukasz Ligowski

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.