Challenges in Achieving Optimal GPU Performance for FFT on NVIDIA Jetson AGX Orin

Hello NVIDIA Community,

I’m working on optimizing an FFT algorithm on the NVIDIA Jetson AGX Orin for signal processing applications, particularly in the context of radar data analysis for my company. While GPUs are generally considered advantageous for parallel processing tasks, I’m encountering some unexpected performance results in my benchmarks.

Algorithm:FFT, implemented using cuFFT with both single and dual GPU setups.
Dataset Sizes: Ranging from small to large (up to 33 million points)
Performance Observations:

  • The CPU consistently outperforms the GPU, especially for smaller datasets.
  • Single GPU execution is significantly slower than CPU, even for larger datasets.
  • Dual GPU setup shows better performance than single GPU but still does not consistently outperform the CPU.

Key Challenges & Findings:

  1. Memory Transfer Overhead:The time taken to transfer data between CPU and GPU seems to negate the benefits of parallel processing on the GPU, particularly for smaller datasets.
  2. Kernel Launch and Synchronization Overheads:These overheads appear to be substantial, especially when the dataset size does not fully utilize the GPU’s parallel capabilities.
  3. Algorithm Characteristics:The FFT algorithm might not be fully leveraging the GPU’s architecture, particularly in terms of parallelization and memory management.
  4. Suboptimal Scaling:As the dataset size increases, the expected scaling benefits on the GPU are not realized, suggesting possible issues with how the FFT is implemented or parallelized.

Questions:

  1. GPU vs. CPU for FFT: Why might the GPU, particularly on the Jetson AGX Orin, underperform compared to the CPU for FFT operations? Are there inherent limitations or overheads in GPU-based FFT that could explain this?
  2. Optimization Techniques: What specific optimization techniques would you recommend to improve GPU performance for FFT, especially in the context of signal processing and radar applications? Are there advanced features in cuFFT that could better leverage the GPU for large-scale FFTs?
  3. Best Practices: Could you provide guidance on best practices for minimizing memory transfer overheads and optimizing kernel execution for FFT on the Jetson platform?

Hi,

1. We don’t find FFT runs slower on Orin’s GPU.
Could you share a reproducible sample so we can check it further?

2. Let’s double-check the question.1 first.
3. Jetson is a shared memory system so it’s possible to have some zero-copy memory.
You can find more details in this document.

For your reference, we also have some cuFFT samples below:

Thanks.

Hi AastaLLL,

Thank you for your suggestions. As requested, I’ve included a sample of the code that I’m using for the FFT operations on the Jetson AGX Orin. The code covers the initialization, execution on both single and dual GPUs, and verification steps.
image

include <cuda_runtime.h>
include <cufft.h>
include
include
include
include

define N 16777216 // Example larger dataset size

// Function to initialize data
void initializeData(cufftComplex* data, int n) {
for (int i = 0; i < n; i++) {
data[i].x = cos(2 * M_PI * i / n);
data[i].y = 0.0f;
}
}

// Function to normalize data after inverse FFT
global void normalizeData(cufftComplex* data, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
data[idx].x /= n;
data[idx].y /= n;
}
}

// Function to verify results
bool verifyResult(cufftComplex* original, cufftComplex* result, int n) {
for (int i = 0; i < n; i++) {
if (fabs(original[i].x - result[i].x) > 1e-5 || fabs(original[i].y - result[i].y) > 1e-5) {
std::cout << "Verification FAILED at index " << i << std::endl;
std::cout << “Expected: (” << original[i].x << ", " << original[i].y << “) Got: (” << result[i].x << ", " << result[i].y << “)” << std::endl;
return false;
}
}
return true;
}

int main() {
// Allocate pinned host memory for faster transfers
cufftComplex* h_data;
cufftComplex* h_result;
cudaMallocHost(&h_data, sizeof(cufftComplex) * N);
cudaMallocHost(&h_result, sizeof(cufftComplex) * N);

// Initialize data
initializeData(h_data, N);

// Allocate device memory
cufftComplex *d_data1, *d_data2, *d_data_single;
cudaSetDevice(0);
cudaMalloc((void**)&d_data1, sizeof(cufftComplex) * (N / 2));
cudaMalloc((void**)&d_data_single, sizeof(cufftComplex) * N);
cudaSetDevice(1);
cudaMalloc((void**)&d_data2, sizeof(cufftComplex) * (N / 2));

// Create FFT plans
cufftHandle plan1, plan2, plan_single;
cudaSetDevice(0);
cufftPlan1d(&plan1, N / 2, CUFFT_C2C, 1);
cufftPlan1d(&plan_single, N, CUFFT_C2C, 1);
cudaSetDevice(1);
cufftPlan1d(&plan2, N / 2, CUFFT_C2C, 1);

// CUDA streams for overlapping memory transfers and computation
cudaStream_t stream1, stream2, stream_single;
cudaSetDevice(0);
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream_single);
cudaSetDevice(1);
cudaStreamCreate(&stream2);

// Measure dual GPU execution time
double totalDualGpuTime = 0.0;
const int runs = 10;

for (int i = 0; i < runs; ++i) {
    auto start = std::chrono::high_resolution_clock::now();

    // Execute FFT on dual GPUs
    cudaSetDevice(0);
    cudaMemcpyAsync(d_data1, h_data, sizeof(cufftComplex) * (N / 2), cudaMemcpyHostToDevice, stream1);
    cufftExecC2C(plan1, d_data1, d_data1, CUFFT_FORWARD);
    cufftExecC2C(plan1, d_data1, d_data1, CUFFT_INVERSE);
    normalizeData<<<(N / 2 + 255) / 256, 256>>>(d_data1, N / 2);

    cudaSetDevice(1);
    cudaMemcpyAsync(d_data2, h_data + (N / 2), sizeof(cufftComplex) * (N / 2), cudaMemcpyHostToDevice, stream2);
    cufftExecC2C(plan2, d_data2, d_data2, CUFFT_FORWARD);
    cufftExecC2C(plan2, d_data2, d_data2, CUFFT_INVERSE);
    normalizeData<<<(N / 2 + 255) / 256, 256>>>(d_data2, N / 2);

    cudaSetDevice(0);
    cudaMemcpyAsync(h_result, d_data1, sizeof(cufftComplex) * (N / 2), cudaMemcpyDeviceToHost, stream1);

    cudaSetDevice(1);
    cudaMemcpyAsync(h_result + (N / 2), d_data2, sizeof(cufftComplex) * (N / 2), cudaMemcpyDeviceToHost, stream2);

    cudaStreamSynchronize(stream1);
    cudaStreamSynchronize(stream2);

    auto end = std::chrono::high_resolution_clock::now();
    totalDualGpuTime += std::chrono::duration<double>(end - start).count();
}

// Average dual GPU execution time
std::cout << "Average Dual GPU Execution Time over " << runs << " runs: " << (totalDualGpuTime / runs) << " seconds." << std::endl;

// Measure single GPU execution time
double totalSingleGpuTime = 0.0;
for (int i = 0; i < runs; ++i) {
    auto start = std::chrono::high_resolution_clock::now();

    cudaMemcpyAsync(d_data_single, h_data, sizeof(cufftComplex) * N, cudaMemcpyHostToDevice, stream_single);
    cufftExecC2C(plan_single, d_data_single, d_data_single, CUFFT_FORWARD);
    cufftExecC2C(plan_single, d_data_single, d_data_single, CUFFT_INVERSE);
    normalizeData<<<(N + 255) / 256, 256>>>(d_data_single, N);
    cudaMemcpyAsync(h_result, d_data_single, sizeof(cufftComplex) * N, cudaMemcpyDeviceToHost, stream_single);
    cudaStreamSynchronize(stream_single);

    auto end = std::chrono::high_resolution_clock::now();
    totalSingleGpuTime += std::chrono::duration<double>(end - start).count();
}

// Average single GPU execution time
std::cout << "Average Single GPU Execution Time over " << runs << " runs: " << (totalSingleGpuTime / runs) << " seconds." << std::endl;

// Verify result
bool isCorrect = verifyResult(h_data, h_result, N);
if (isCorrect) {
    std::cout << "FFT verification PASSED" << std::endl;
} else {
    std::cout << "FFT verification FAILED" << std::endl;
}

// Measure CPU execution time
double totalCpuTime = 0.0;
for (int i = 0; i < runs; ++i) {
    auto start = std::chrono::high_resolution_clock::now();

    // Simple FFT on CPU (using a naive implementation)
    for (int j = 0; j < N; j++) {
        h_result[j].x = h_data[j].x;  // Placeholder for CPU computation
        h_result[j].y = h_data[j].y;  // Placeholder for CPU computation
    }

    auto end = std::chrono::high_resolution_clock::now();
    totalCpuTime += std::chrono::duration<double>(end - start).count();
}

std::cout << "Average CPU FFT Execution Time over " << runs << " runs: " << (totalCpuTime / runs) << " seconds." << std::endl;

// Cleanup
cufftDestroy(plan1);
cufftDestroy(plan2);
cufftDestroy(plan_single);
cudaFree(d_data1);
cudaFree(d_data2);
cudaFree(d_data_single);
cudaStreamDestroy(stream1);
cudaStreamDestroy(stream2);
cudaStreamDestroy(stream_single);
cudaFreeHost(h_data);
cudaFreeHost(h_result);

return 0;

}

I’m particularly interested in any recommendations you might have for optimizing this code, especially in terms of:

Minimizing the memory transfer overhead between CPU and GPU. Ensuring that the FFT plan is optimally configured for the Jetson AGX Orin architecture.
Improving the overall execution time on the GPU, particularly in dual GPU mode.

every time I applied the fft algorithm on the nvidia jetson agx orin in my code, the performance of the cpu was better than the gpu, although the performance of the gpu is better in the code I just added, but after the different methods I tried, the performance results are very close.

Thank you again for your assistance. I look forward to your feedback.

Best regards,

Hi,

Thanks for sharing the sample.
We will give it a check and provide more info to you later.

Hi,

Thanks for your patience.

We tried to reproduce your issue but the source you shared combines image and source and is not easy to test.
Would you mind attaching the file directly?

When running convolutionFFT2D sample on Orin + JetPack 6, we can get 3091 MPix/s with GPU.
While the reference CPU version is only 3.721550 MPix/s.

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.