Challenges in Achieving Optimal GPU Performance for FFT on NVIDIA Jetson AGX Orin

yahya.girgin · August 14, 2024, 6:24pm

Hello NVIDIA Community,

I’m working on optimizing an FFT algorithm on the NVIDIA Jetson AGX Orin for signal processing applications, particularly in the context of radar data analysis for my company. While GPUs are generally considered advantageous for parallel processing tasks, I’m encountering some unexpected performance results in my benchmarks.

Algorithm:FFT, implemented using cuFFT with both single and dual GPU setups.
Dataset Sizes: Ranging from small to large (up to 33 million points)
Performance Observations:

The CPU consistently outperforms the GPU, especially for smaller datasets.
Single GPU execution is significantly slower than CPU, even for larger datasets.
Dual GPU setup shows better performance than single GPU but still does not consistently outperform the CPU.

Key Challenges & Findings:

Memory Transfer Overhead:The time taken to transfer data between CPU and GPU seems to negate the benefits of parallel processing on the GPU, particularly for smaller datasets.
Kernel Launch and Synchronization Overheads:These overheads appear to be substantial, especially when the dataset size does not fully utilize the GPU’s parallel capabilities.
Algorithm Characteristics:The FFT algorithm might not be fully leveraging the GPU’s architecture, particularly in terms of parallelization and memory management.
Suboptimal Scaling:As the dataset size increases, the expected scaling benefits on the GPU are not realized, suggesting possible issues with how the FFT is implemented or parallelized.

Questions:

GPU vs. CPU for FFT: Why might the GPU, particularly on the Jetson AGX Orin, underperform compared to the CPU for FFT operations? Are there inherent limitations or overheads in GPU-based FFT that could explain this?
Optimization Techniques: What specific optimization techniques would you recommend to improve GPU performance for FFT, especially in the context of signal processing and radar applications? Are there advanced features in cuFFT that could better leverage the GPU for large-scale FFTs?
Best Practices: Could you provide guidance on best practices for minimizing memory transfer overheads and optimizing kernel execution for FFT on the Jetson platform?

AastaLLL · August 15, 2024, 2:52am

Hi,

1. We don’t find FFT runs slower on Orin’s GPU.
Could you share a reproducible sample so we can check it further?

2. Let’s double-check the question.1 first.
3. Jetson is a shared memory system so it’s possible to have some zero-copy memory.
You can find more details in this document.

For your reference, we also have some cuFFT samples below:

Thanks.

yahya.girgin · August 15, 2024, 6:31am

Hi AastaLLL,

Thank you for your suggestions. As requested, I’ve included a sample of the code that I’m using for the FFT operations on the Jetson AGX Orin. The code covers the initialization, execution on both single and dual GPUs, and verification steps.

include <cuda_runtime.h>
include <cufft.h>
include
include
include
include

define N 16777216 // Example larger dataset size

// Function to initialize data
void initializeData(cufftComplex* data, int n) {
for (int i = 0; i < n; i++) {
data[i].x = cos(2 * M_PI * i / n);
data[i].y = 0.0f;
}
}

// Function to normalize data after inverse FFT
global void normalizeData(cufftComplex* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
data[idx].x /= n;
data[idx].y /= n;
}
}

// Function to verify results
bool verifyResult(cufftComplex* original, cufftComplex* result, int n) {
for (int i = 0; i < n; i++) {
if (fabs(original[i].x - result[i].x) > 1e-5 || fabs(original[i].y - result[i].y) > 1e-5) {
std::cout << "Verification FAILED at index " << i << std::endl;
std::cout << “Expected: (” << original[i].x << ", " << original[i].y << “) Got: (” << result[i].x << ", " << result[i].y << “)” << std::endl;
return false;
}
}
return true;
}

int main() {
// Allocate pinned host memory for faster transfers
cufftComplex* h_data;
cufftComplex* h_result;
cudaMallocHost(&h_data, sizeof(cufftComplex) * N);
cudaMallocHost(&h_result, sizeof(cufftComplex) * N);

// Initialize data
initializeData(h_data, N);

// Allocate device memory
cufftComplex *d_data1, *d_data2, *d_data_single;
cudaSetDevice(0);
cudaMalloc((void**)&d_data1, sizeof(cufftComplex) * (N / 2));
cudaMalloc((void**)&d_data_single, sizeof(cufftComplex) * N);
cudaSetDevice(1);
cudaMalloc((void**)&d_data2, sizeof(cufftComplex) * (N / 2));

// Create FFT plans
cufftHandle plan1, plan2, plan_single;
cudaSetDevice(0);
cufftPlan1d(&plan1, N / 2, CUFFT_C2C, 1);
cufftPlan1d(&plan_single, N, CUFFT_C2C, 1);
cudaSetDevice(1);
cufftPlan1d(&plan2, N / 2, CUFFT_C2C, 1);

// CUDA streams for overlapping memory transfers and computation
cudaStream_t stream1, stream2, stream_single;
cudaSetDevice(0);
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream_single);
cudaSetDevice(1);
cudaStreamCreate(&stream2);

// Measure dual GPU execution time
double totalDualGpuTime = 0.0;
const int runs = 10;

for (int i = 0; i < runs; ++i) {
    auto start = std::chrono::high_resolution_clock::now();

    // Execute FFT on dual GPUs
    cudaSetDevice(0);
    cudaMemcpyAsync(d_data1, h_data, sizeof(cufftComplex) * (N / 2), cudaMemcpyHostToDevice, stream1);
    cufftExecC2C(plan1, d_data1, d_data1, CUFFT_FORWARD);
    cufftExecC2C(plan1, d_data1, d_data1, CUFFT_INVERSE);
    normalizeData<<<(N / 2 + 255) / 256, 256>>>(d_data1, N / 2);

    cudaSetDevice(1);
    cudaMemcpyAsync(d_data2, h_data + (N / 2), sizeof(cufftComplex) * (N / 2), cudaMemcpyHostToDevice, stream2);
    cufftExecC2C(plan2, d_data2, d_data2, CUFFT_FORWARD);
    cufftExecC2C(plan2, d_data2, d_data2, CUFFT_INVERSE);
    normalizeData<<<(N / 2 + 255) / 256, 256>>>(d_data2, N / 2);

    cudaSetDevice(0);
    cudaMemcpyAsync(h_result, d_data1, sizeof(cufftComplex) * (N / 2), cudaMemcpyDeviceToHost, stream1);

    cudaSetDevice(1);
    cudaMemcpyAsync(h_result + (N / 2), d_data2, sizeof(cufftComplex) * (N / 2), cudaMemcpyDeviceToHost, stream2);

    cudaStreamSynchronize(stream1);
    cudaStreamSynchronize(stream2);

    auto end = std::chrono::high_resolution_clock::now();
    totalDualGpuTime += std::chrono::duration<double>(end - start).count();
}

// Average dual GPU execution time
std::cout << "Average Dual GPU Execution Time over " << runs << " runs: " << (totalDualGpuTime / runs) << " seconds." << std::endl;

// Measure single GPU execution time
double totalSingleGpuTime = 0.0;
for (int i = 0; i < runs; ++i) {
    auto start = std::chrono::high_resolution_clock::now();

    cudaMemcpyAsync(d_data_single, h_data, sizeof(cufftComplex) * N, cudaMemcpyHostToDevice, stream_single);
    cufftExecC2C(plan_single, d_data_single, d_data_single, CUFFT_FORWARD);
    cufftExecC2C(plan_single, d_data_single, d_data_single, CUFFT_INVERSE);
    normalizeData<<<(N + 255) / 256, 256>>>(d_data_single, N);
    cudaMemcpyAsync(h_result, d_data_single, sizeof(cufftComplex) * N, cudaMemcpyDeviceToHost, stream_single);
    cudaStreamSynchronize(stream_single);

    auto end = std::chrono::high_resolution_clock::now();
    totalSingleGpuTime += std::chrono::duration<double>(end - start).count();
}

// Average single GPU execution time
std::cout << "Average Single GPU Execution Time over " << runs << " runs: " << (totalSingleGpuTime / runs) << " seconds." << std::endl;

// Verify result
bool isCorrect = verifyResult(h_data, h_result, N);
if (isCorrect) {
    std::cout << "FFT verification PASSED" << std::endl;
} else {
    std::cout << "FFT verification FAILED" << std::endl;
}

// Measure CPU execution time
double totalCpuTime = 0.0;
for (int i = 0; i < runs; ++i) {
    auto start = std::chrono::high_resolution_clock::now();

    // Simple FFT on CPU (using a naive implementation)
    for (int j = 0; j < N; j++) {
        h_result[j].x = h_data[j].x;  // Placeholder for CPU computation
        h_result[j].y = h_data[j].y;  // Placeholder for CPU computation
    }

    auto end = std::chrono::high_resolution_clock::now();
    totalCpuTime += std::chrono::duration<double>(end - start).count();
}

std::cout << "Average CPU FFT Execution Time over " << runs << " runs: " << (totalCpuTime / runs) << " seconds." << std::endl;

// Cleanup
cufftDestroy(plan1);
cufftDestroy(plan2);
cufftDestroy(plan_single);
cudaFree(d_data1);
cudaFree(d_data2);
cudaFree(d_data_single);
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);
cudaStreamDestroy(stream_single);
cudaFreeHost(h_data);
cudaFreeHost(h_result);

return 0;

}

I’m particularly interested in any recommendations you might have for optimizing this code, especially in terms of:

Minimizing the memory transfer overhead between CPU and GPU. Ensuring that the FFT plan is optimally configured for the Jetson AGX Orin architecture.
Improving the overall execution time on the GPU, particularly in dual GPU mode.

every time I applied the fft algorithm on the nvidia jetson agx orin in my code, the performance of the cpu was better than the gpu, although the performance of the gpu is better in the code I just added, but after the different methods I tried, the performance results are very close.

Thank you again for your assistance. I look forward to your feedback.

Best regards,

AastaLLL · August 19, 2024, 6:46am

Hi,

Thanks for sharing the sample.
We will give it a check and provide more info to you later.

AastaLLL · August 28, 2024, 8:45am

Hi,

Thanks for your patience.

We tried to reproduce your issue but the source you shared combines image and source and is not easy to test.
Would you mind attaching the file directly?

When running convolutionFFT2D sample on Orin + JetPack 6, we can get 3091 MPix/s with GPU.
While the reference CPU version is only 3.721550 MPix/s.

Thanks.

system · October 9, 2024, 2:54am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Multiple batches of 1D FFT using cuFFT GPU-Accelerated Libraries	10	4932	October 29, 2019
cuFFT Timing Jetson TX2	14	2436	October 18, 2021
2D-FFT Benchmarks on Jetson AGX with various precisions Jetson AGX Xavier cuda	6	2705	October 18, 2021
cufft doubt comparing r2c and c2c 2D FFTs CUDA Programming and Performance	28	13479	October 27, 2010
cufftXt batch 1D GPU-Accelerated Libraries	12	2128	October 15, 2019
Poor CUFFT Performance? Am I doing something wrong? CUDA Programming and Performance	15	15484	May 4, 2010
Trouble with cuFFT on multiple GPUs GPU-Accelerated Libraries	13	3663	August 26, 2017
FFT Speed vs. x86 CUDA Programming and Performance	14	24639	July 27, 2008
How to show CuFFT routines show higher performance than normal MATLAB fft() in terms of time taken. CUDA Programming and Performance	13	3154	July 10, 2014
Bad Performance of CUFFT library? compilation flags for optimizing fft performance CUDA Programming and Performance	11	13482	February 17, 2012

Challenges in Achieving Optimal GPU Performance for FFT on NVIDIA Jetson AGX Orin

Related topics