Another CUDA implementation of a Dynamic Programming problem

This is a useful problem I converted into a CUDA implementation, and this prototype can be applied to a variety of DP or matching problems.

Problems of this type cannot be solved quickly using expensive software such as MATLAB, and also are even more difficult to map to multi-core CPU implementations.

But the CUDA model does work here and runs about 115-120 times faster than an 3.9 GHZ CPU implementation(including all memory copies and memsets).

Apx running time is (NumBoxes+1)x(2^(NumColors)) x NumColors + (NumBoxes+1)x(2^(NumColors)) + NumBoxes x NumColors x NumBoxes.

link to README:

link to both CPU code and GPU code of the problem:

I wish more GPU educational material(this includes you Nvidia) would stop with the GPU pseudocode examples and post CPU and GPU codes next to each other for comparison. With GPU programming implementation does matter, which is why clear uncluttered examples are needed. Since most people learn serial CPU programming first, it only makes sense to show GPU code next to its CPU equivalent.

This CUDA implementation has good examples of how to use atomic(s), shared memory, and multi-dimensional thread blocks.

In addition I have a slightly faster version which also returns not only the minimum number of moves needed, but also the specific moves made to accomplish that optimal answer. PM me for that version.

I may not have a PHd, but I still can make fast correct code.

There are libraries which handle the FFT, sorting, and Linear Algebra routines, but none for Graph/Brute Force/Dynamic Programming problems. This is why such work is important, because these types of problems are derived from real-life examples.