noob questions their intended 'use case' for parallel programming with CUDA


I lately completed Lesson 1 of NVIDIA’s Udacity course, ‘Intro to Parallel Programming’, which has led me to question my intended use case for CUDA. I seek to confirm whether my use case is appropriate, here.

I would take take arbitrarily many different mathematical functions and evaluate each function in parallel on the same set of values (domain) to value the functions (range).

As I understand it, this intention turns the standard usage of CUDA and parallel processing on its head: instead of a vector of values, I have a vector of functions for evaluation.

Thank you for any comment, including, as appropriate, suggested alternative approaches.


It is difficult to give specific feedback based on a pretty vague scenario.

How similar are these mathematical functions? Are they sets of polynomials with different coefficients for example? Is there a unifying representation such as matrix exponentiation that could be used to express these functions? How many different functions are there and how many different values do they operate on? Does each function operate on more than one value?

Can you describe the use case at a somewhat higher (more abstract) level? Is this in the context of approximation, or integration, for example? Are you implementing a well-known or named algorithm? If so, what is that algorithm?

For a GPU implementation, you need good parallelism, as ideally you want to be running in the high thousands to tens of thousands threads in parallel, with as little control flow divergence between threads as possible. Depending on the specifics of a particular use case, it may be possible to trade-off uniformity of control flow with redundancy of computation.

Thank you for your response, njuffa.

I do not seek to solve a system of equations. I am not implementing any approximation, integration or known algorithm.

My goal is to understand the computing constraints that apply given a goal to obtain the values for a set of polynomial functions given a set of coefficients as I increase the number of coefficients and the number of functions. There will be many more functions than coefficients. After evaluating the set of functions, I would rank them.

Contrary to what I have seen in Nvidia’s ‘Intro to Parallelization’ course, I would expect to have many more functions than coefficients. There, I saw good parallelism indicated by a large vector of values, whereas I expect a large number of polynomials and few values.

Thank you for any additional comment you would offer.


From your description, this does sound like some sort of approximation or optimization problem to me, where a fitness function (e.g. error bound) is used to rank polynomials based on an evaluation at a few selected points. So in this case the sets of coefficients for the polynomials are really your data, and there seem to be many such sets, meaning lots of potential parallelism.

You don’t describe how the ranking affects the selection of those coefficients, so there may be a serial dependency there. But maybe ranking can be deferred to increase the parallelism? E.g. increase the number of coefficient sets tried, a bit brute force, before ranking them.

If all the functions are polynomials, they should fit nicely into a non-divergent control flow for evaluating them. My recommendation would simply be to start with a prototype, use the profiler to see what you can learn about the bottlenecks in your code, then devise strategies to alleviate those bottlenecks.

Good afternoon…I think from a mathematical definition, what you are talking about is basically an approximation of a path integral; (so a function maps a point to a point, a functional maps a function to a point-look up Richard Feynman and his path integral approximation methods);

If this is correct, that is a good application for parallel programing because basically you have the same functional being evaluated over an infinite (clearly approximation of infinite here) number of paths.

let us know how it goes!


njuffa and rick, thank you both for your kind attention.

Let me take a final try at characterizing my problem.

A user is interested to know the values of an arbitrarily large set of polynomials that result from a set of coefficients. It is typical that there will be many more polynomials for evaluation than members of the given coefficients. The user may or may not like to reason about the system based on the results of the set of polynomials over successive sets of coefficients. What is a computationally efficient way to deliver the user their values for a given set of coefficients?

I expect this is a poor case for CUDA, because there will be so many more functions than values. In CUDA, I see that a kernel is one function to be evaluated simultaneously for many values. Meanwhile, in the case with which I am concerned, it may be that many, many functions are to be evaluated for a single coefficient.

I allow that my writing on this has been inadequate. I hope this last attempt is useful to finally clarify/confirm.


I am not sure I understand what is meant by “there will be many more polynomials for evaluation than members of the given coefficients”. A set of coefficients defines a polynomial, assuming the set is ordered in some way. For example, under one possible convention the coefficient set {a, b, c} defines the polynomial ax**2 + bx + c, while {q, r, s, t} defines the polynomial qx**3 + rx**2 + s*x + t.

What I am envisioning is this: There is code that evaluates a polynomial (using Horner’s scheme, Estrin’s scheme, or whatever). Each thread reads one set of coefficients and then loops over the small set of specified input values to produce output values by evaluating the polynomials with those coefficients. There are as many threads as there are sets of coefficients. Since you stated there are a lot of such sets (which I take to means thousands) you have sufficient parallelism. As long as the polynomials have roughly the same degree (number of terms to evaluate) performance impact from thread divergence should be minor.