Sliding least-squares signal filter over 1D data: 1D or 2D for speed?

Question: How to performance-optimize this CUDA (numba) code?

Background: I am coding a least-squares fitter to determine the best signal shape in 1D data. Due to the non-converging nature of the problem, algorithms such as Levenberg–Marquardt can not be used. Instead, I compare a set of 1D signal shapes to the 1D data by sliding (signals are shorter than the data) and least squares calculations, to determine the best combination of position and signal. The code runs nice with Python + numba on the CPU. The amount of data to be analyzed, however, is astronomical. So I try to leverage my GTX 1060 using CUDA. This works in 1D:

@cuda.jit
def cuda1(data, signal, dy, chi2):
    i = cuda.grid(1)
    tx = cuda.threadIdx.x
    bx = cuda.blockIdx.x
    bw = cuda.blockDim.x
    i = tx + bx * bw
    xstride = cuda.gridsize(1)
    for i in range(i, chi2.shape[0], xstride):
        value = 0
        for j in range(len(signal)):
            value = value + ((data[i+j]-signal[j])**2) * dy[i+j]
        chi2[i] = value

And this version in 2D

@cuda.jit
def cuda2(data, signal, dy, chi2map):
    startX = cuda.blockDim.x * cuda.blockIdx.x + cuda.threadIdx.x
    startY = cuda.blockDim.y * cuda.blockIdx.y + cuda.threadIdx.y
    gridX = cuda.gridDim.x * cuda.blockDim.x;
    gridY = cuda.gridDim.y * cuda.blockDim.y;
    for x in range(startX, chi2map.shape[0], gridX):
        for y in range(startY, chi2map.shape[1], gridY):
            chi2map[x,y] = ((data[x+y]-signal[y])**2) * dy[x+y]

For reference, the CPU version:

@numba.jit(fastmath=True, parallel=False, cache=True, nopython=True)  
def cpu(data, signal, dy):
    chi2 = numpy.zeros(outer_loop_length)
    for i in range(outer_loop_length):
        value = 0
        for j in range(len(signal)):
            value = value + ((data[i+j]-signal[j])**2) * dy[i+j]
        chi2[i] = value
    return chi2

When I benchmark these with typical problem sizes

data_length = 50000
signal_length = 5000

excluding the time to transfer data back and forth (this will be re-used and thus negligible) I find that the CPU is faster by a factor of ~2:

  • Time cuda1: 0.069
  • Time cuda2: 0.084
  • Time CPU: 0.042

Are there obvious performance blockers? Have I already reached the limit? For my real data, I will have to run such an algorithm ~500 times on the same data (with changing “signal”), and then again ~10^3 times on different data for all signals. So speed is really important.

Thank you so much in advance for your help. The code to run the benchmark is as follows:

data = numpy.full(data_length, 3, dtype='float32')
data_cuda = cuda.to_device(data)
dy = numpy.full(data_length, 2, dtype='float32')
dy_cuda = cuda.to_device(dy)
signal = numpy.full(signal_length, 1, dtype='float32')
signal_cuda = cuda.to_device(signal)
outer_loop_length = data_length - signal_length + 1
C_global_mem = cuda.device_array(outer_loop_length, dtype='float32')

# Run cuda1 once to compile
cuda1[32,32](data_cuda, signal_cuda, dy_cuda, C_global_mem)
t1 = time.perf_counter()
cuda1[32,32](data_cuda, signal_cuda, dy_cuda, C_global_mem)
chi2 = numpy.array(C_global_mem, dtype=numpy.float32)
t2 = time.perf_counter()
print('Time cuda1', t2-t1)
print('Result to check:', numpy.sum(chi2))

# Run cuda2
chi2map = numpy.zeros((outer_loop_length, signal_length), dtype='float32')
cuda2[32,32](data_cuda, signal_cuda, dy_cuda, chi2map)
C_global_mem = cuda.device_array((outer_loop_length, signal_length), dtype='float32')
t1 = time.perf_counter()
cuda2[32,32](data_cuda, signal_cuda, dy_cuda, C_global_mem)
t2 = time.perf_counter()
chi2map = numpy.array(C_global_mem, dtype=numpy.float32)
row_sums = numpy.sum(chi2map, axis=1)
print('Time cuda2', t2-t1)
print('Result to check:', numpy.sum(row_sums))

# CPU version
result = cpu(data, signal, dy)
t1 = time.perf_counter()
result = cpu(data, signal, dy)
t2 = time.perf_counter()
print('CPU Time', t2-t1)
print('Result to check:', numpy.sum(result))

Low hanging fruits:

You could experiment with your for loops. Like partially unrolling them (manually). Swapping the i and j loops, to see if there is an effect on computation speed.

Shared memory use:

I found a code example that uses shared memory to reduce global memory accesses and speed up a matrix multiplication kernel. See if your predefined signal shapes could be preloaded into shared memory before correlating them with the signal. The shared memory size limit is 48kb per thread block on recent consumer nVidia GPUs. But even if the full signal shapes do not fit at once, consider correlating e.g. 16kb segments of the signal shapes with 32kb pieces of the signal. Each thread block could be working on a different part of the shape, or a different part of the signal. Partial results for the squared error can be accumulated in global memory using atomic operations.

Example shared memory code here:
https://numba.pydata.org/numba-doc/dev/cuda/examples.html

Advanced stuff (textures…):

You might want to see if you can place your signal arrays into a texture as shown in some pycuda examples
https://wiki.tiker.net/PyCuda/Examples/SobelFilter

this one uses a 2D texture of data type unsigned char. Your signal is probably a 1D signal of type float.

there is a discussion thread showing how to use both pycuda and numba together, to allow texture access inside numba
https://groups.google.com/a/continuum.io/forum/#!topic/numba-users/IKlwf1Kwmvc

Not sure if that is useful to you, as this seems like some pretty advanced hackery.

A general optimization idea that would apply to both the CPU as well as a GPU version:

Would a multi-resolution approach result in a speedup? i.e. slide a 16x downsampled version of the signal shapes over a 16x downsampled signal. There will be a bit of an overhead to successively downsample the signal a couple of times (applying a lowpass to prevent aliasing)

The initial signal matching should be up to 256 times faster than the computation at full resolution (the downsampled signal is 16x shorter and your signal shape is 16x shorter).

Make a list of the N best candidates of matched positions and signal shapes.

Then refine the candidate using the 8x downsampled signal and shapes… then with 4x, 2x, 1x downsampling.

Finally pick the overall best candidate after refinement.

Maybe you can also try skipping some powers of two, e.g. apply 64x, 16x, 4x, 1x downsampling.

Thank you for your effort - I appreciate it!

Some tests:

  • Loop unrolling: +-20%
  • Swapping i,j loops: +-10%
  • Shared memory: Usually "signal" fits in, but not "data" and not "dy". If I put only "signal" in shared memory, speed does not change
  • Advanced stuff: This is beyond my skill set :_(
  • General optimization using binning: Yes, this idea is in principle great. This can be implemented for a subset of the signals. Some signal types might qualify for a Levenberg-Marquard-type algorithm as well. Unfortunately a subset of the signals does not allow for binning, as there are very sharp features that get smeared even with 2x binning. But it is certainly something to keep in mind.

I’m a bit at a loss here. Granted, I have not much experience with CUDA. But the problem as such is very parallel in nature. In a simplified version, I have two 1D arrays of sizes 50k and 5k, and want to subtract and multiply these at all their shift positions. I very much believe that there is an algorithmic way to do this nearly perfectly in parallel which should be a dream job for a GPU. At the moment, I’m not sure if my understanding is totally wrong, or if I’m just lacking the skill to code it. Do you think hiring a professional for a day or two would help? As this is from academia (exoplanet transits), money is scarce, but a freelancer might work…

From the looks of it, your code is limited entirely by memory bandwidth, not by computational throughput (note the minimal amount of computation performed on each piece of data). You could use a profiler to confirm but it seems obvious.

Data sets this small are an excellent fit for the CPU’s caches. Caches provide extremely high bandwidth at very low latency, easily more bandwidth than the global memory of your GPU (~ 200 GB/sec theoretical, ~ 160 GB/sec effective).

The GPU’s shared memory may offer similar bandwidth, but your data set is too large to fit into shared memory. You could look into tiling the data where each tile fits into shared memory. I have no idea how to do that in Python, though. You could also try a GPU with significantly higher memory bandwidth. Professional models range up to 800 GB/sec or so, but may be prohibitively expensive. Use of a GTX 1080 Ti with about 480 GB/sec (theoretical) may be more practical.

In terms of performance on a CPU, I would look for a recent CPU model (generally better cache bandwidth) with high core clocks, then parallelize across cores if Python supports that.

For having a bigger subset of CUDA API features available, consider switching to pycuda or even CUDA C/C++ (using the runtime API)

There are nVidia GPUs such as the Tesla P4 and Tesla P40 that offer fast integer performance for up to 4x the throughput of floating point computations (at reduced accuracy). This can be useful if going to an 8 or 16 bit quantized signal is an option.

Astronomy buffs have been using this hardware feature for power efficient and fast correlation of radio signals, as hinted in this article https://devblogs.nvidia.com/mixed-precision-programming-cuda-8/

There is also the option to go to with half float precision. Some nVidia SKUs (notably NOT most consumer models) can do this with twice the arithmetic throughput of single precision computation.

https://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH____HALF2__ARITHMETIC.html#group__CUDA__MATH____HALF2__ARITHMETIC

I think that the memory bandwidth limits can be mitigated somewhat, but the numba package does not really provide the necessary feature set to do so.

Would you mind sharing a test dataset with known good reference results, allowing us to run the above numba scripts to reproduce the timings and verify correctness of any optimizations we might try?

If Python supports importing the data as half precision (fp16), that would certainly allow the code to process twice the number of data items within the same bandwidth limitations, even if the processing itself uses single precision (fp32), which is the only really fast arithmetic on reasonably-priced consumer GPUs.

This is a scenario that often works very well for processing physical, sensor-derived data: many sensors do not deliver more than 10-bit resolution anyhow, and the use of single precision for subsequent computation prevents harmful accumulation of round-off errors relative to the low-precision signals. I am highly skeptical of computing with half precision, as losing just a few bits to round-off error may not leave enough “good” bits in the final result to be useful.

I don’t know whether this works with Python, but in general GPU optimizations should be profiler driven. The CUDA profiler is pretty good at pointing out the bottlenecks in code running on the GPU.

Have you considered to train a Neural network to detect the signals you are looking for? Not too long ago there were media reports about radio astronomers finding previously undetected radio bursts in old data sets using AI. http://blpd0.ssl.berkeley.edu/frb-machine/CNNFRB_eprint.pdf