Jetson AGX Xavier Shared Memory GPU Signal Processing

I’m working on a real-time pulse detection and parameter extraction algorithm (PDW extractor) that processes IQ data. I’ve implemented a version that uses GPU acceleration (using CuPy and cuSignal) for parts of the pipeline (IQ conversion and magnitude calculation) and CPU for FFT processing. However, my current code still has bottlenecks because the FFT is performed on the CPU and there are several GPU–CPU transfers per pulse.

I’d like to accelerate the entire pipeline by performing as much work as possible on the GPU and fully parallelizing the processing. Specifically, I’m interested in advice on designing a GPU-side parallelization pipeline to:

  • Keep data resident in GPU memory using shared memory (via cusignal.get_shared_mem),
  • Process multiple pulses in parallel (batch extraction and FFT on the GPU), and
  • Minimize data transfers between GPU and CPU.

Below is my offline version for trying the code for reference:

import numpy as np
import cupy as cp
import cusignal
import time
from scipy import signal  # (optional, if you need additional signal processing)

# --------------------------
# 1. Load Data and Set Up GPU Shared Memory
# --------------------------

# Load 1M samples from file (assumes file "pri_1ms" exists)
sc16q11_data = np.fromfile("pri_1ms", dtype=np.int16)[:1_000_000]

# Create a GPU shared memory array using cuSignal
gpu_buffer = cusignal.get_shared_mem(1_000_000, dtype=np.int16)
gpu_buffer[:] = sc16q11_data

# --------------------------
# 2. GPU IQ Conversion and Magnitude Calculation
# --------------------------

# Convert the shared memory array to a CuPy array and reshape into IQ pairs
iq = cp.asarray(gpu_buffer).astype(cp.float32).reshape(-1, 2)
iq /= 2048.0  # Q11 scaling

# Convert to complex64 (IQ format) and compute magnitude on GPU
complex64_data = iq[:, 0] + 1j * iq[:, 1]
magnitude_data = cp.abs(complex64_data)

# --------------------------
# 3. Pulse Detection on GPU then Transfer Mask to CPU
# --------------------------

# Define threshold and other parameters
threshold = 0.1
samp_rate = 10e6       # Sample rate: 10 MHz
center_freq = 1e9      # Center frequency: 1 GHz

# Create a threshold mask on GPU
threshold_mask = magnitude_data >= threshold

# Transfer the threshold mask to CPU for edge detection using np.diff
threshold_mask_cpu = cp.asnumpy(threshold_mask)

# Find rising and falling edges of the pulses
diff = np.diff(threshold_mask_cpu.astype(np.int8))
rising = np.where(diff == 1)[0] + 1
falling = np.where(diff == -1)[0] + 1

# Handle edge cases if the first or last sample is above threshold
if threshold_mask_cpu[0]:
    rising = np.insert(rising, 0, 0)
if threshold_mask_cpu[-1]:
    falling = np.append(falling, len(threshold_mask_cpu) - 1)

# Ensure we have valid rising and falling pairs
if rising.size > 0 and falling.size > 0 and rising[0] > falling[0]:
    falling = falling[1:]

min_len = min(len(rising), len(falling))
rising = rising[:min_len]
falling = falling[:min_len]

# --------------------------
# 4. Set Up FFT Parameters (to Run on CPU)
# --------------------------

fft_size = 4096

# Compute a Hann window on CPU (since FFT will run on CPU)
window_cpu = np.hanning(fft_size).astype(np.float32)

# --------------------------
# 5. Process Each Pulse: GPU Extraction, Transfer to CPU, and CPU FFT
# --------------------------

pulse_count = 0
total_start = time.perf_counter()

for i in range(min_len):
    start = rising[i]
    end = falling[i]
    
    # Skip if indices are invalid or out-of-bounds
    if start >= end or end > len(complex64_data):
        continue
    
    pulse_length = end - start
    # Ignore pulses that are too short (e.g., noise)
    if pulse_length < 10:
        continue

    # --------------------------
    # a) Transfer Pulse Samples from GPU to CPU
    # --------------------------
    # We only transfer the small segment (the pulse) to CPU memory
    pulse_samples_cpu = cp.asnumpy(complex64_data[start:end])
    
    # --------------------------
    # b) Prepare the FFT Input on CPU with Windowing and Zero-Padding
    # --------------------------
    fft_input = np.zeros(fft_size, dtype=np.complex64)
    length = int(min(pulse_length, fft_size))
    # Multiply the pulse with the window before the FFT
    fft_input[:length] = pulse_samples_cpu[:length] * window_cpu[:length]
    
    # --------------------------
    # c) Perform FFT on CPU
    # --------------------------
    fft_start = time.perf_counter()
    fft_result = np.fft.fft(fft_input)
    fft_mag = np.abs(fft_result)
    peak_bin = np.argmax(fft_mag)
    fft_time = (time.perf_counter() - fft_start) * 1000  # in ms
    
    # --------------------------
    # d) Compute Pulse Parameters
    # --------------------------
    # Compute average amplitude (transfer computed value from GPU to CPU)
    amp = cp.mean(magnitude_data[start:end]).get()
    
    # Compute frequency offset from the FFT bin (adjusting for aliasing)
    freq_offset = peak_bin * samp_rate / fft_size
    if freq_offset <= samp_rate / 2:
        freq = center_freq + freq_offset
    else:
        freq = center_freq - (samp_rate - freq_offset)
    
    # Compute pulse width in microseconds
    pulse_width = pulse_length / samp_rate * 1e6
    
    # Compute pulse repetition interval (PRI) if not the first pulse
    pri = None if i == 0 else (start - rising[i - 1]) / samp_rate * 1e6
    
    pulse_count += 1

    # For demonstration, print pulse details
    print(f"Pulse {pulse_count}: Amp = {amp:.3f}, "
          f"Freq = {freq/1e6:.3f} MHz, PW = {pulse_width:.2f} µs, "
          f"FFT Time = {fft_time:.2f} ms, PRI = {pri if pri is None else f'{pri:.2f} µs'}")

total_time = (time.perf_counter() - total_start) * 1000  # in ms
print(f"Total extraction time: {total_time:.2f} ms")
print(f"{pulse_count} pulses found")

My current pipeline does the following:

  1. Data Loading & Shared Memory:
    I load the data (1M samples) and copy it to GPU shared memory (using cusignal.get_shared_mem), then convert it to a CuPy array.
  2. GPU Processing:
    I perform IQ conversion, scaling, forming complex data, and computing the magnitude on the GPU.
  3. Pulse Detection:
    I apply a threshold on the GPU to create a boolean mask, then transfer that mask to the CPU for detecting rising and falling edges.
  4. FFT Processing (CPU):
    For each pulse detected, I transfer the pulse segment from GPU to CPU, apply windowing, zero-padding, and perform FFT on the CPU.
  5. Pulse Parameter Extraction:
    I compute amplitude, frequency offset, pulse width, and pulse repetition interval (PRI) from the FFT and the data.

My Goals and Questions:

  • Accelerating the Pipeline:
    How can I accelerate this code further by performing more work on the GPU (e.g., moving FFT processing to the GPU, batching pulse processing, reducing GPU–CPU transfers)?
  • GPU Parallelization Pipeline:
    I’m particularly interested in ideas for a GPU-side parallelization pipeline. For instance, processing multiple pulses in parallel entirely on the GPU (e.g., by batching pulses into a 2D array for batched FFT computation) and minimizing host-device transfers.
    • Has anyone implemented a similar approach?
    • What techniques or frameworks (e.g., CuPy’s batched FFT, custom CUDA kernels, etc.) would you recommend?
    • How can I best manage concurrency using CUDA streams or other mechanisms to fully utilize GPU resources?

Any guidance, sample code, or resources would be greatly appreciated.

Thank you in advance for your help!

Hi,

Do you need a Python sample?

GPU implementataion for FFT

We have a cuFFT library and you can find some examples below:

cuda-samples/Samples/4_CUDA_Libraries/simpleCUFFT at master · NVIDIA/cuda-samples · GitHub

CUDALibrarySamples/cuFFT at master · NVIDIA/CUDALibrarySamples · GitHub

Steams

With multiple streams, the tasks can be executed in the GPU concurrently.
Some examples can be found here.

Buffer

To access a buffer with both CPU and GPU, you can try to use unified memory or pinned memory.
Unified memory allocates two copies of buffer, one on GPU and the other on CPU.
But users don’t need to take care of it. Instead, the GPU driver will auto-synchronize and induce some overhead.

Pinned memory is to allocate a page-lock memory so the address won’t change. (allow GPU access)
However, accessing the pinned memory with GPU is expected to be slightly slower.

Below is a document for memory on Jetson for your reference:

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.