Parallelizing feedback loops

denoir · March 9, 2009, 11:18pm

I’m making a cuda version of a type of recurrent neural network and I’ve run into a parallelization problem. In essence the system looks like this:

[Feedback Memory]->[MLP Neural Net]
External Image

For those unfamiliar with the subject, implementing a standard neural net is very simple: it’s just a bunch of matrix multiplications with a non-linear function applied here and there. It’s quite parallelizable and no problem for CUDA, you just use all the data directly and you get two orders of magnitude faster results than on a CPU. As the neural net is static, there are no time dependencies in the data. Sample n+1 is decoupled from sample n-1.

The problem is in the feedback memory. With feedback memories sample n+1 is a function of sample n. A typical data set has perhaps about a hundred features (columns) and tens of thousands of samples (rows). If you are forced to feed the GPU one sample at a time you can at most get ~100 threads working simultaneously (equal to the number of features (columns) in the data) - which is truly a misuse of the GPU and the performance will be comparable to CPU performance.

If I was to give another similar problem: suppose you have a few differential equations to solve but need to do it in a large number of steps.

This seems to me as a very fundamental problem that either 1) has no solution (i.e the problem is completely serial in nature and can’t be parallelized) or 2) has a standard solution or approach to solution.

So my question is if somebody has dealt with a similar problem or has any ideas on how to solve it?

jack · March 9, 2009, 11:39pm

What kind of differential equations do you need to solve? I’m fairly sure that people have done both ODE and PDE solvers on CUDA. Whether they will fit into your specific NN algorithm is another issue altogether, but you may be able to find some code if you search the forums here.

If the samples are decoupled, couldn’t you then multiple identical (in terms of inputs, hidden layers, etc.) neural networks in parallel on different parts of the data (i.e. split your source data into chunks), then combine the nets together at the end?

denoir · March 10, 2009, 12:11am

I mentioned the differential equations just as an example of a similar problem - for my particular problem they are not needed.

As for the samples being decoupled, the problem is that they’re not. As we’re talking about time series data (hence the feedback memories) - each sample depends on past samples. In the illustration above I used a simple recurrent connection which forms an infinite impulse response filter. In that simplified case the influence of a past sample decays exponentially. In reality I use a structure with digital gates so that the memory depth is really unlimited and exact. In short the samples have to be processed in strict order.

The second part (the traditional neural net) is no problem for parallelization, but it is of little use if the feedback memory is the bottleneck.

MisterAnderson42 · March 10, 2009, 12:21pm

Many apparently non-parallel problems can be made parallel with the use of the scan algorithm. It is not immediately apparent whether it could be applied in this case, but if you read up on the scan (there is a nice whitepaper in the SDK) and look at it as a big hammer you may find out that your problem is a nail.

jgoffeney · March 10, 2009, 1:05pm

Here is a short paper of someone else taking a stab at a GPU based neural network. Perhaps it will give you some hints or at least have some useful references to pre-GPU parallel implementations.

GPU Neural Net

denoir · March 10, 2009, 4:18pm

@MisterAnderson42
Thank you, I’ll look into it. It may be possible to do some variation on it, but I’ll have to read up more and think a bit about it.

_Big_Mac · March 10, 2009, 9:48pm

Do you work on stored or real-time streaming data? You might consider doing a sort of software pipelining perhaps?

denoir · March 11, 2009, 9:13pm

Stored data, at lest for now. I did consider a pipeline approach, but the depth of the serial process isn’t sufficient to to make it worth implementing. I can at best process a few samples in parallel, but that is two or three orders of magnitude less than what would be needed to speed it up.

There is one aspect to this problem that I haven’t mentioned and that is that I will be running an optimization algorithm (genetic or similar) on top of it. So in essence I’m not really interested in training one neural net at a time but a large number of them. This makes it possible (at least in theory) to get a fair amount of processing done even if it is sample-by sample as this could be done for many systems at once. It is still perhaps an order of magnitude short of the type of speedup you can get from running a standard static neural net (one at a time), but I may have to live with it.

I still have some experimentation and math to do for a modified scan algorithm, but it is likely it won’t work out. The scan algorithm can do this very well: output[i] = output[i-1] + f(input[i-1]) but my problem is (simplified) on the form output = f(output[i]) - input[i]. This may seem like a trivial difference, but when you break down the scan algorithm it becomes apparent that it’s really not.

Topic		Replies	Views
Optimization of recurrrent neural networks CUDA Programming and Performance	0	1572	October 7, 2009
systems of ordinary differential equations examples using CUDA? CUDA Programming and Performance	11	23148	February 7, 2012
GA & NN & CUDA CUDA Programming and Performance	3	5668	April 12, 2010
Some questions about CUDA and my problem CUDA Programming and Performance	3	2374	June 18, 2008
Parallel Patterns Request for participation! CUDA Programming and Performance	2	3604	March 2, 2009
serial programming serial programming in NVIDIA CUDA Programming and Performance	3	6197	September 15, 2008
Parallelisim CUDA Programming and Performance	3	2614	August 24, 2007
parallelizing CUDA Programming and Performance	5	5336	March 13, 2007
CUDA Use Cases run serial algorithms on composite data CUDA Programming and Performance	14	4544	October 24, 2008
Data dependency CUDA Programming and Performance	4	5572	June 11, 2009

Parallelizing feedback loops

Related topics