Sliding least-squares signal filter over 1D data: 1D or 2D for speed?

michael123 · October 2, 2018, 7:59pm

Question: How to performance-optimize this CUDA (numba) code?

Background: I am coding a least-squares fitter to determine the best signal shape in 1D data. Due to the non-converging nature of the problem, algorithms such as Levenberg–Marquardt can not be used. Instead, I compare a set of 1D signal shapes to the 1D data by sliding (signals are shorter than the data) and least squares calculations, to determine the best combination of position and signal. The code runs nice with Python + numba on the CPU. The amount of data to be analyzed, however, is astronomical. So I try to leverage my GTX 1060 using CUDA. This works in 1D:

@cuda.jit
def cuda1(data, signal, dy, chi2):
    i = cuda.grid(1)
    tx = cuda.threadIdx.x
    bx = cuda.blockIdx.x
    bw = cuda.blockDim.x
    i = tx + bx * bw
    xstride = cuda.gridsize(1)
    for i in range(i, chi2.shape[0], xstride):
        value = 0
        for j in range(len(signal)):
            value = value + ((data[i+j]-signal[j])**2) * dy[i+j]
        chi2[i] = value

And this version in 2D

@cuda.jit
def cuda2(data, signal, dy, chi2map):
    startX = cuda.blockDim.x * cuda.blockIdx.x + cuda.threadIdx.x
    startY = cuda.blockDim.y * cuda.blockIdx.y + cuda.threadIdx.y
    gridX = cuda.gridDim.x * cuda.blockDim.x;
    gridY = cuda.gridDim.y * cuda.blockDim.y;
    for x in range(startX, chi2map.shape[0], gridX):
        for y in range(startY, chi2map.shape[1], gridY):
            chi2map[x,y] = ((data[x+y]-signal[y])**2) * dy[x+y]

For reference, the CPU version:

@numba.jit(fastmath=True, parallel=False, cache=True, nopython=True)  
def cpu(data, signal, dy):
    chi2 = numpy.zeros(outer_loop_length)
    for i in range(outer_loop_length):
        value = 0
        for j in range(len(signal)):
            value = value + ((data[i+j]-signal[j])**2) * dy[i+j]
        chi2[i] = value
    return chi2

When I benchmark these with typical problem sizes

data_length = 50000
signal_length = 5000

excluding the time to transfer data back and forth (this will be re-used and thus negligible) I find that the CPU is faster by a factor of ~2:

Time cuda1: 0.069
Time cuda2: 0.084
Time CPU: 0.042

Are there obvious performance blockers? Have I already reached the limit? For my real data, I will have to run such an algorithm ~500 times on the same data (with changing “signal”), and then again ~10^3 times on different data for all signals. So speed is really important.

Thank you so much in advance for your help. The code to run the benchmark is as follows:

data = numpy.full(data_length, 3, dtype='float32')
data_cuda = cuda.to_device(data)
dy = numpy.full(data_length, 2, dtype='float32')
dy_cuda = cuda.to_device(dy)
signal = numpy.full(signal_length, 1, dtype='float32')
signal_cuda = cuda.to_device(signal)
outer_loop_length = data_length - signal_length + 1
C_global_mem = cuda.device_array(outer_loop_length, dtype='float32')

# Run cuda1 once to compile
cuda1[32,32](data_cuda, signal_cuda, dy_cuda, C_global_mem)
t1 = time.perf_counter()
cuda1[32,32](data_cuda, signal_cuda, dy_cuda, C_global_mem)
chi2 = numpy.array(C_global_mem, dtype=numpy.float32)
t2 = time.perf_counter()
print('Time cuda1', t2-t1)
print('Result to check:', numpy.sum(chi2))

# Run cuda2
chi2map = numpy.zeros((outer_loop_length, signal_length), dtype='float32')
cuda2[32,32](data_cuda, signal_cuda, dy_cuda, chi2map)
C_global_mem = cuda.device_array((outer_loop_length, signal_length), dtype='float32')
t1 = time.perf_counter()
cuda2[32,32](data_cuda, signal_cuda, dy_cuda, C_global_mem)
t2 = time.perf_counter()
chi2map = numpy.array(C_global_mem, dtype=numpy.float32)
row_sums = numpy.sum(chi2map, axis=1)
print('Time cuda2', t2-t1)
print('Result to check:', numpy.sum(row_sums))

# CPU version
result = cpu(data, signal, dy)
t1 = time.perf_counter()
result = cpu(data, signal, dy)
t2 = time.perf_counter()
print('CPU Time', t2-t1)
print('Result to check:', numpy.sum(result))

cbuchner1 · October 4, 2018, 2:15pm

Low hanging fruits:

You could experiment with your for loops. Like partially unrolling them (manually). Swapping the i and j loops, to see if there is an effect on computation speed.

Shared memory use:

I found a code example that uses shared memory to reduce global memory accesses and speed up a matrix multiplication kernel. See if your predefined signal shapes could be preloaded into shared memory before correlating them with the signal. The shared memory size limit is 48kb per thread block on recent consumer nVidia GPUs. But even if the full signal shapes do not fit at once, consider correlating e.g. 16kb segments of the signal shapes with 32kb pieces of the signal. Each thread block could be working on a different part of the shape, or a different part of the signal. Partial results for the squared error can be accumulated in global memory using atomic operations.

Example shared memory code here:
https://numba.pydata.org/numba-doc/dev/cuda/examples.html

Advanced stuff (textures…):

You might want to see if you can place your signal arrays into a texture as shown in some pycuda examples
https://wiki.tiker.net/PyCuda/Examples/SobelFilter

this one uses a 2D texture of data type unsigned char. Your signal is probably a 1D signal of type float.

there is a discussion thread showing how to use both pycuda and numba together, to allow texture access inside numba
[url]Redirecting to Google Groups

Not sure if that is useful to you, as this seems like some pretty advanced hackery.

cbuchner1 · October 4, 2018, 3:24pm

A general optimization idea that would apply to both the CPU as well as a GPU version:

Would a multi-resolution approach result in a speedup? i.e. slide a 16x downsampled version of the signal shapes over a 16x downsampled signal. There will be a bit of an overhead to successively downsample the signal a couple of times (applying a lowpass to prevent aliasing)

The initial signal matching should be up to 256 times faster than the computation at full resolution (the downsampled signal is 16x shorter and your signal shape is 16x shorter).

Make a list of the N best candidates of matched positions and signal shapes.

Then refine the candidate using the 8x downsampled signal and shapes… then with 4x, 2x, 1x downsampling.

Finally pick the overall best candidate after refinement.

Maybe you can also try skipping some powers of two, e.g. apply 64x, 16x, 4x, 1x downsampling.

michael123 · October 4, 2018, 6:55pm

Thank you for your effort - I appreciate it!

Some tests:

Loop unrolling: +-20%
Swapping i,j loops: +-10%
Shared memory: Usually "signal" fits in, but not "data" and not "dy". If I put only "signal" in shared memory, speed does not change
Advanced stuff: This is beyond my skill set :_(
General optimization using binning: Yes, this idea is in principle great. This can be implemented for a subset of the signals. Some signal types might qualify for a Levenberg-Marquard-type algorithm as well. Unfortunately a subset of the signals does not allow for binning, as there are very sharp features that get smeared even with 2x binning. But it is certainly something to keep in mind.

I’m a bit at a loss here. Granted, I have not much experience with CUDA. But the problem as such is very parallel in nature. In a simplified version, I have two 1D arrays of sizes 50k and 5k, and want to subtract and multiply these at all their shift positions. I very much believe that there is an algorithmic way to do this nearly perfectly in parallel which should be a dream job for a GPU. At the moment, I’m not sure if my understanding is totally wrong, or if I’m just lacking the skill to code it. Do you think hiring a professional for a day or two would help? As this is from academia (exoplanet transits), money is scarce, but a freelancer might work…

njuffa · October 4, 2018, 7:46pm

From the looks of it, your code is limited entirely by memory bandwidth, not by computational throughput (note the minimal amount of computation performed on each piece of data). You could use a profiler to confirm but it seems obvious.

Data sets this small are an excellent fit for the CPU’s caches. Caches provide extremely high bandwidth at very low latency, easily more bandwidth than the global memory of your GPU (~ 200 GB/sec theoretical, ~ 160 GB/sec effective).

The GPU’s shared memory may offer similar bandwidth, but your data set is too large to fit into shared memory. You could look into tiling the data where each tile fits into shared memory. I have no idea how to do that in Python, though. You could also try a GPU with significantly higher memory bandwidth. Professional models range up to 800 GB/sec or so, but may be prohibitively expensive. Use of a GTX 1080 Ti with about 480 GB/sec (theoretical) may be more practical.

In terms of performance on a CPU, I would look for a recent CPU model (generally better cache bandwidth) with high core clocks, then parallelize across cores if Python supports that.

cbuchner1 · October 4, 2018, 9:23pm

For having a bigger subset of CUDA API features available, consider switching to pycuda or even CUDA C/C++ (using the runtime API)

There are nVidia GPUs such as the Tesla P4 and Tesla P40 that offer fast integer performance for up to 4x the throughput of floating point computations (at reduced accuracy). This can be useful if going to an 8 or 16 bit quantized signal is an option.

Astronomy buffs have been using this hardware feature for power efficient and fast correlation of radio signals, as hinted in this article https://devblogs.nvidia.com/mixed-precision-programming-cuda-8/

There is also the option to go to with half float precision. Some nVidia SKUs (notably NOT most consumer models) can do this with twice the arithmetic throughput of single precision computation.

[url]https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH____HALF2__ARITHMETIC.html#group__CUDA__MATH____HALF2__ARITHMETIC[/url]

I think that the memory bandwidth limits can be mitigated somewhat, but the numba package does not really provide the necessary feature set to do so.

Would you mind sharing a test dataset with known good reference results, allowing us to run the above numba scripts to reproduce the timings and verify correctness of any optimizations we might try?

njuffa · October 4, 2018, 9:38pm

If Python supports importing the data as half precision (fp16), that would certainly allow the code to process twice the number of data items within the same bandwidth limitations, even if the processing itself uses single precision (fp32), which is the only really fast arithmetic on reasonably-priced consumer GPUs.

This is a scenario that often works very well for processing physical, sensor-derived data: many sensors do not deliver more than 10-bit resolution anyhow, and the use of single precision for subsequent computation prevents harmful accumulation of round-off errors relative to the low-precision signals. I am highly skeptical of computing with half precision, as losing just a few bits to round-off error may not leave enough “good” bits in the final result to be useful.

I don’t know whether this works with Python, but in general GPU optimizations should be profiler driven. The CUDA profiler is pretty good at pointing out the bottlenecks in code running on the GPU.

cbuchner1 · October 5, 2018, 7:36am

Have you considered to train a Neural network to detect the signals you are looking for? Not too long ago there were media reports about radio astronomers finding previously undetected radio bursts in old data sets using AI. http://blpd0.ssl.berkeley.edu/frb-machine/CNNFRB_eprint.pdf

Topic		Replies	Views
What can't you do in CUDA that you'd like? Requests for the future CUDA Programming and Performance	407	134572	May 26, 2010
CUDA image processing Accelaration tips anyone? CUDA Programming and Performance	20	6063	November 16, 2010
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	679	April 4, 2017
CUDA 2.1 FAQ Please read before posting CUDA Programming and Performance	10	210990	January 18, 2014
Cuda code performance CUDA Programming and Performance	14	3145	December 16, 2014
What are you guys doing with cuda? just wanna find a way to go CUDA Programming and Performance	81	56090	February 7, 2013
Nonlinear least-squares? CUDA Programming and Performance	19	13416	October 12, 2021
Is CUDA right for me? (FDTD) FDTD user needs fast computations while handling massive 3-D arrays CUDA Programming and Performance	17	22418	December 29, 2008
CUDA very slow performance CUDA Programming and Performance	21	16720	March 6, 2020
Cuda Latency problems Slow Cuda CUDA Programming and Performance	15	13932	September 5, 2008

Sliding least-squares signal filter over 1D data: 1D or 2D for speed?

Related topics