ML + Time series

I have coded an application in C++ based on machine learning making use of various time series, for example temperature readings. The application now runs completely on the CPU but it’s excruciatingly slow for large time series (say 500k elements), which is the usual size I work with. When I mean slow think months to finish!!!

After having read through the programming guide and looking at my code I can see that there are some things that are not perfect for GPU computing, although I think that a good speedup can be achieved. Since I am new to this I wonder, for example, how can I optimize code like the following for GPUs:

Naive implementation:

void TestFunc(float alpha, float beta, float *in, float *out, int numPoints)
  for (int i = 1; i < numPoints; i++) {
    out[i] = alpha * in[i] + beta * out[i-1];

My problem is the reference of previous value of out. I have seen implementations of prefix sum which are similar but way above my current level. I reference previous values a lot in my calculations, so it is important to solve this problem for me.

Any ideas welcome!


I do not know if this is possible in your specific computational domain, but maybe you can process hundreds of independent time series in parallel. Then the structure of above loop hardly changes, except that it runs on many threads. Also consider interleaving the time series in GPU memory, so you get fully coalesced read access.

There are parallel solutions to recurrence relations, which may be relevant. If a parallel prefix sum is “above your level” then it might be that such an investigation is also. I think the seminal paper here is Kogge and Stone. As cbuchner1 has pointed out, if you have a lot of these to solve, then coarse-grained parallelism may be a better (easier) approach than fine-grained parallelism.

Great, got some pointers form this for further investigation. Unfortunately cbuchner1 approach won’t work. I think I can solve this using a Kogge-Stone Adder. A good look at a prefix sum implementation should help as well.

Another question I have is related to ordered execution of the threads on CUDA. I have one for loop where I need the values to executed in parallel but in a ordered or sequential fashion. Is this possible in CUDA?


Probably not in the way that you have in mind. After all, if such a thing were trivially possible and efficient, we wouldn’t be discussing things like parallel prefix sums, etc.

You certainly can, in an ugly fashion, “force” CUDA threads to execute in any order you wish. Doing so in an attempt to run a naive serial algorithm such as a prefix sum or recurrence relation, where each thread would depend on the result from the “previous” thread, would completely eliminate any efficiencies from parallel execution and would likely run slower than a naive serial single-threaded implementation.

It is for this reason that parallel prefix sums and parallel reductions exist, distinctly from their naive serial cousins.

Again, to extend the idea put forth by cbuchner1, a certain form of coarse-grained sequential activity could work, as long as there is some fine-grained parallel work to keep a warp busy. So if you had a group of threads that could compute their work in parallel (independently, let’s say) but you also wanted to impose some “coarse-grained” sequential order, such that those groups of threads would collectively iterate through some coarse-grained for-loop, that might be possible, i.e. efficient.

As it has been pointed out, parallel prefix scan is essential (there was an easy to read white paper by Nvidia included with the sample code). Parallel prefix scan is a very common operation, it may be a good idea to learn it anyway.

To break the problem down to simple steps:
A = alpha, B = beta, N = number of points,

Create an Aux array = [B^(N-1), B^(N-2), …, B^0] in ceil(lg(N)) iterations.
Multiply (pointwise) Aux array with Input array, yielding Input array.
Multiply (scalar) the Input array with A, yielding Input array.
Perform prefix sum(forward-inclusive) on Input array, yielding Output array.

Thanks for your answers. Having thought about how to execute my code better, perhaps a coarse-grained + fine-grained parallel execution could be possible. I am still getting used to think in parallel.

Prefix sum seems very complex at this point, but probably necessary to understand if I want to proceed.


Actually my general view is that for operations more complicated than a reduction, I would encourage general CUDA programmers to avoid trying to write their own, if a library function is available. In the case of prefix sums, reductions, and various other “hard-to-write” parallel operations, there are library functions in thrust, cub, and other libraries we could use.

Here’s a worked example of:

  1. Your original test function
  2. the recurrence algorithm proposed by Vectorizer in ordinary serial C-code
  3. a parallel realization of 2 in thrust
$ cat
#include <stdio.h>

#include <math.h>
#include <thrust/device_vector.h>
#include <thrust/scan.h>
#include <thrust/sequence.h>
#include <thrust/transform.h>
#include <thrust/functional.h>

#define N 10
#define B 2
#define A 3

typedef int mytype;

void TestFunc(mytype alpha, mytype beta, mytype *in, mytype *out, int numPoints)
  for (int i = 1; i < numPoints; i++) {
    out[i] = alpha * in[i] + beta * out[i-1];

struct my_pow_func
  mytype r;
  my_pow_func(mytype _r) : r(_r) {}
  __host__ __device__
  mytype operator()(mytype p){
    return powf(r,p);
struct my_scale_func
  mytype r;
  my_scale_func(mytype _r) : r(_r)  {}
  __host__ __device__
  mytype operator()(mytype p){
    return p*r;

int main(){

  int in[N], out[N], test[N], aux[N];

  // test
  for (int i = 0; i < N; i++)
    in[i] = 1;
  thrust::device_vector<mytype> d_in(in, in+N);
  test[0] = A*in[0];
  TestFunc(A, B, in, test, N);

  //serial method
  aux[0] = 1;
  for (int i = 1; i < N; i++)
    aux[i] = aux[i-1]*B;
  for (int i = 0; i < N; i++)
    in[i] = aux[i]*in[i];
  for (int i = 0; i < N; i++)
    in[i] = A*in[i];
  out[0] = in[0];
  for (int i = 1; i < N; i++)
    out[i] = out[i-1] + in[i];

  //using thrust
  thrust::device_vector<mytype> d_aux(N);
  thrust::device_vector<mytype> d_out(N);

  thrust::sequence(d_aux.begin(), d_aux.end());
  thrust::transform(d_aux.begin(), d_aux.end(), d_aux.begin(), my_pow_func(B));
  thrust::transform(d_aux.begin(), d_aux.end(), d_in.begin(), d_in.begin(), thrust::multiplies<mytype>());
  thrust::transform(d_in.begin(), d_in.end(), d_in.begin(), my_scale_func(A));
  thrust::inclusive_scan(d_in.begin(), d_in.end(), d_out.begin());
  for (int i=0; i < N; i++){
    mytype tres = d_out[i];
    printf("%d: %f, %f, %f\n", i, (float)test[i], (float)out[i], (float)tres);}

$ nvcc -o t12
$ ./t12
0: 3.000000, 3.000000, 3.000000
1: 9.000000, 9.000000, 9.000000
2: 21.000000, 21.000000, 21.000000
3: 45.000000, 45.000000, 45.000000
4: 93.000000, 93.000000, 93.000000
5: 189.000000, 189.000000, 189.000000
6: 381.000000, 381.000000, 381.000000
7: 765.000000, 765.000000, 765.000000
8: 1533.000000, 1533.000000, 1533.000000
9: 3069.000000, 3069.000000, 3069.000000

(thrust: )

Thanks a lot for your example! Now it’s starting to look like a feasible endeavor. Except for Thrust I also looked into ArrayFire. Thrust looks really easy to use, but as always it will be a question of performance vs convenience. At least now I feel that this is possible, I just want to get the execution time of my code down to days instead of months.

Thanks a lot for your help!