FFT runs slower than expected on Jetson Orin Nano Super

ajones65fq0 · July 18, 2025, 3:45pm

FFT takes 40% longer than expected on the Jetson Orin Nano Super. It has 1024 cores running at 1020MHz, so it should be 2.6x slower than my RTX 2060 which has 1920 cores and runs at 1395MHz, but in fact I find it runs 3.6x slower. What might be the underlying reason for this? Is there any thing we could do to tune it? It makes a lot of difference in our time-constrained application.
Test code: (by the way I am timing only the loop not the set up)

include <cufft.h>

cufftHandle getCufftPlan(size_t rows, size_t cols, int nbatch, cufftType type) {
    cufftHandle handle;
    constexpr int rank = 2;
    constexpr int stride = 1;
    int dims[rank] = { (int)rows, (int)cols };
    int vol = dims[0] * dims[1];
   cufftPlanMany(&handle, rank, dims, dims, stride, vol, dims, stride, vol, type, nbatch);
    return handle;
}

int main(int argc, char** argv) {
    int w = 4912/2;
    int h = 3684/2;
    cufftComplex* input;
    cufftComplex* output;
    cudaMalloc(&input, w * h * sizeof(cufftComplex));
    cudaMalloc(&output, w * h * sizeof(cufftComplex));
    cudaMemset(input, 0x00, w * h * sizeof(cufftComplex));
    cudaMemset(output, 0x00, w * h * sizeof(cufftComplex));
    cufftHandle plan = getCufftPlan(h, w, 1, CUFFT_C2C);

    //auto time0 = std::chrono::high_resolution_clock::now();
    for (int i=0; i<100; ++i) {
        cufftExecC2C(plan, input, output, CUFFT_FORWARD);
    }
    //auto time1= std::chrono::high_resolution_clock::now();
    //auto duration = std::chrono::duration_cast<std::chrono::microseconds>(time1 - time0);
    //std::cout << duration.count() << std::endl;

    cufftDestroy(plan);
    cudaFree(output);
    cudaFree(input);
}

AastaLLL · July 21, 2025, 5:19am

Hi,

Do you want to compare the performance between RTX2060 and Orin Nano?
For FFT, you will need to compare the FP32 performance instead of the GPU cores and clock rate.

Please note that the floating operations tend to be slow on the Jetson device.
Since the platform is designed for edge AI, which tends to use low-precision operations like int8.

Thanks.

system · August 13, 2025, 1:46am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Jetson Nano cuFFT Jetson Nano	4	1824	October 14, 2021
cuFFT Timing Jetson TX2	14	2489	October 18, 2021
Challenges in Achieving Optimal GPU Performance for FFT on NVIDIA Jetson AGX Orin Jetson AGX Orin gpu-computing	5	419	August 28, 2024
1D cufft of matrix columns is very slow (1.5 second) Jetson AGX Orin cuda	9	408	September 10, 2025
Jetson Nano cuFFT and streams Jetson Nano cuda	3	781	October 15, 2021
TFlops seem too high Jetson Orin Nano jetson-inference	2	938	June 16, 2023
Help! Why is the GPU acceleration ratio so small compared to the CPU in Jetson Orin Nano 8G? Jetson Orin Nano cuda	6	407	November 20, 2023
Jetson orin nano fp16/int8 performance Jetson Orin Nano jetson-inference	8	712	March 18, 2025
Why nano run faser than orin nano when i inference cyclegan with pytorch Jetson Orin Nano jetson-inference	3	221	May 15, 2024
FFT Computation Timing constraint on GPU. CUDA Programming and Performance	0	711	August 22, 2014

FFT runs slower than expected on Jetson Orin Nano Super

Related topics