Jetson AGX Xavier Shared Memory GPU Signal Processing

seckinoncu80 · February 17, 2025, 8:38pm

I’m working on a real-time pulse detection and parameter extraction algorithm (PDW extractor) that processes IQ data. I’ve implemented a version that uses GPU acceleration (using CuPy and cuSignal) for parts of the pipeline (IQ conversion and magnitude calculation) and CPU for FFT processing. However, my current code still has bottlenecks because the FFT is performed on the CPU and there are several GPU–CPU transfers per pulse.

I’d like to accelerate the entire pipeline by performing as much work as possible on the GPU and fully parallelizing the processing. Specifically, I’m interested in advice on designing a GPU-side parallelization pipeline to:

Keep data resident in GPU memory using shared memory (via cusignal.get_shared_mem),
Process multiple pulses in parallel (batch extraction and FFT on the GPU), and
Minimize data transfers between GPU and CPU.

Below is my offline version for trying the code for reference:

import numpy as np
import cupy as cp
import cusignal
import time
from scipy import signal  # (optional, if you need additional signal processing)

# --------------------------
# 1. Load Data and Set Up GPU Shared Memory
# --------------------------

# Load 1M samples from file (assumes file "pri_1ms" exists)
sc16q11_data = np.fromfile("pri_1ms", dtype=np.int16)[:1_000_000]

# Create a GPU shared memory array using cuSignal
gpu_buffer = cusignal.get_shared_mem(1_000_000, dtype=np.int16)
gpu_buffer[:] = sc16q11_data

# --------------------------
# 2. GPU IQ Conversion and Magnitude Calculation
# --------------------------

# Convert the shared memory array to a CuPy array and reshape into IQ pairs
iq = cp.asarray(gpu_buffer).astype(cp.float32).reshape(-1, 2)
iq /= 2048.0  # Q11 scaling

# Convert to complex64 (IQ format) and compute magnitude on GPU
complex64_data = iq[:, 0] + 1j * iq[:, 1]
magnitude_data = cp.abs(complex64_data)

# --------------------------
# 3. Pulse Detection on GPU then Transfer Mask to CPU
# --------------------------

# Define threshold and other parameters
threshold = 0.1
samp_rate = 10e6       # Sample rate: 10 MHz
center_freq = 1e9      # Center frequency: 1 GHz

# Create a threshold mask on GPU
threshold_mask = magnitude_data >= threshold

# Transfer the threshold mask to CPU for edge detection using np.diff
threshold_mask_cpu = cp.asnumpy(threshold_mask)

# Find rising and falling edges of the pulses
diff = np.diff(threshold_mask_cpu.astype(np.int8))
rising = np.where(diff == 1)[0] + 1
falling = np.where(diff == -1)[0] + 1

# Handle edge cases if the first or last sample is above threshold
if threshold_mask_cpu[0]:
    rising = np.insert(rising, 0, 0)
if threshold_mask_cpu[-1]:
    falling = np.append(falling, len(threshold_mask_cpu) - 1)

# Ensure we have valid rising and falling pairs
if rising.size > 0 and falling.size > 0 and rising[0] > falling[0]:
    falling = falling[1:]

min_len = min(len(rising), len(falling))
rising = rising[:min_len]
falling = falling[:min_len]

# --------------------------
# 4. Set Up FFT Parameters (to Run on CPU)
# --------------------------

fft_size = 4096

# Compute a Hann window on CPU (since FFT will run on CPU)
window_cpu = np.hanning(fft_size).astype(np.float32)

# --------------------------
# 5. Process Each Pulse: GPU Extraction, Transfer to CPU, and CPU FFT
# --------------------------

pulse_count = 0
total_start = time.perf_counter()

for i in range(min_len):
    start = rising[i]
    end = falling[i]
    
    # Skip if indices are invalid or out-of-bounds
    if start >= end or end > len(complex64_data):
        continue
    
    pulse_length = end - start
    # Ignore pulses that are too short (e.g., noise)
    if pulse_length < 10:
        continue

    # --------------------------
    # a) Transfer Pulse Samples from GPU to CPU
    # --------------------------
    # We only transfer the small segment (the pulse) to CPU memory
    pulse_samples_cpu = cp.asnumpy(complex64_data[start:end])
    
    # --------------------------
    # b) Prepare the FFT Input on CPU with Windowing and Zero-Padding
    # --------------------------
    fft_input = np.zeros(fft_size, dtype=np.complex64)
    length = int(min(pulse_length, fft_size))
    # Multiply the pulse with the window before the FFT
    fft_input[:length] = pulse_samples_cpu[:length] * window_cpu[:length]
    
    # --------------------------
    # c) Perform FFT on CPU
    # --------------------------
    fft_start = time.perf_counter()
    fft_result = np.fft.fft(fft_input)
    fft_mag = np.abs(fft_result)
    peak_bin = np.argmax(fft_mag)
    fft_time = (time.perf_counter() - fft_start) * 1000  # in ms
    
    # --------------------------
    # d) Compute Pulse Parameters
    # --------------------------
    # Compute average amplitude (transfer computed value from GPU to CPU)
    amp = cp.mean(magnitude_data[start:end]).get()
    
    # Compute frequency offset from the FFT bin (adjusting for aliasing)
    freq_offset = peak_bin * samp_rate / fft_size
    if freq_offset <= samp_rate / 2:
        freq = center_freq + freq_offset
    else:
        freq = center_freq - (samp_rate - freq_offset)
    
    # Compute pulse width in microseconds
    pulse_width = pulse_length / samp_rate * 1e6
    
    # Compute pulse repetition interval (PRI) if not the first pulse
    pri = None if i == 0 else (start - rising[i - 1]) / samp_rate * 1e6
    
    pulse_count += 1

    # For demonstration, print pulse details
    print(f"Pulse {pulse_count}: Amp = {amp:.3f}, "
          f"Freq = {freq/1e6:.3f} MHz, PW = {pulse_width:.2f} µs, "
          f"FFT Time = {fft_time:.2f} ms, PRI = {pri if pri is None else f'{pri:.2f} µs'}")

total_time = (time.perf_counter() - total_start) * 1000  # in ms
print(f"Total extraction time: {total_time:.2f} ms")
print(f"{pulse_count} pulses found")

My current pipeline does the following:

Data Loading & Shared Memory:
I load the data (1M samples) and copy it to GPU shared memory (using cusignal.get_shared_mem), then convert it to a CuPy array.
GPU Processing:
I perform IQ conversion, scaling, forming complex data, and computing the magnitude on the GPU.
Pulse Detection:
I apply a threshold on the GPU to create a boolean mask, then transfer that mask to the CPU for detecting rising and falling edges.
FFT Processing (CPU):
For each pulse detected, I transfer the pulse segment from GPU to CPU, apply windowing, zero-padding, and perform FFT on the CPU.
Pulse Parameter Extraction:
I compute amplitude, frequency offset, pulse width, and pulse repetition interval (PRI) from the FFT and the data.

My Goals and Questions:

Accelerating the Pipeline:
How can I accelerate this code further by performing more work on the GPU (e.g., moving FFT processing to the GPU, batching pulse processing, reducing GPU–CPU transfers)?
GPU Parallelization Pipeline:
I’m particularly interested in ideas for a GPU-side parallelization pipeline. For instance, processing multiple pulses in parallel entirely on the GPU (e.g., by batching pulses into a 2D array for batched FFT computation) and minimizing host-device transfers.
- Has anyone implemented a similar approach?
- What techniques or frameworks (e.g., CuPy’s batched FFT, custom CUDA kernels, etc.) would you recommend?
- How can I best manage concurrency using CUDA streams or other mechanisms to fully utilize GPU resources?

Any guidance, sample code, or resources would be greatly appreciated.

Thank you in advance for your help!

AastaLLL · February 18, 2025, 7:35am

Hi,

Do you need a Python sample?

GPU implementataion for FFT

We have a cuFFT library and you can find some examples below:

cuda-samples/Samples/4_CUDA_Libraries/simpleCUFFT at master · NVIDIA/cuda-samples · GitHub

CUDALibrarySamples/cuFFT at master · NVIDIA/CUDALibrarySamples · GitHub

Steams

With multiple streams, the tasks can be executed in the GPU concurrently.
Some examples can be found here.

Buffer

To access a buffer with both CPU and GPU, you can try to use unified memory or pinned memory.
Unified memory allocates two copies of buffer, one on GPU and the other on CPU.
But users don’t need to take care of it. Instead, the GPU driver will auto-synchronize and induce some overhead.

Pinned memory is to allocate a page-lock memory so the address won’t change. (allow GPU access)
However, accessing the pinned memory with GPU is expected to be slightly slower.

Below is a document for memory on Jetson for your reference:

Thanks.

system · March 12, 2025, 3:02am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Challenges in Achieving Optimal GPU Performance for FFT on NVIDIA Jetson AGX Orin Jetson AGX Orin gpu-computing	5	655	August 28, 2024
multithread for CPU/GPU parrallel Jetson AGX Xavier	3	616	October 18, 2021
Streaming Data to the GPU CUDA Programming and Performance	5	4000	November 14, 2010
Too much time in CPU CUDA Programming and Performance	19	2806	July 31, 2015
Shared memory parallel processing for jetson inference Jetson Nano jetson-inference	4	1488	August 17, 2022
High speed data FFT deconvolution feasibility General	0	453	August 9, 2023
cufftXt batch 1D GPU-Accelerated Libraries	12	2362	October 15, 2019
Transfer data CPU/GPU is an issue.. Jetson TX2	8	2020	October 18, 2021
Parallel Image Capture with Processing CUDA Programming and Performance	2	740	July 18, 2017
How to share GPU buffer between VPI, and Jetson Inference Detect in a child thread? Jetson Xavier NX vpi	4	993	April 21, 2023

Jetson AGX Xavier Shared Memory GPU Signal Processing

GPU implementataion for FFT

Steams

Buffer

Related topics