Best practice for function-scope tabular data on the device

More newbie, “how do I port X” questions I’m afraid…

I have a lot of functions which use tabular data, they look something like:

template <class T>
T foo(T arg)
{
   static const T data[] = { /* Initializer list */ };
   return something_involving_data;
}

The problem I have is that nvcc doesn’t allow the static const function scope declaration - it gives an error message saying it should be marked as shared, but when I do that it complains about the initializer list :(

I can remove the static keyword when in device mode, and that works fine, but is presumably less efficient, and certainly will use up valuable stack space. So is there a better way of doing this?

GPUs offer a massive amount of FLOPs so on-the-fly computation is frequently preferred over tabulation. How big are these tables? What kind of access pattern do you expect? Will all threads in a warp typically access the same table element? If so, consider placing the table(s) into constant memory.

Use of constant does indeed look like what I’m looking for, but in this context yields:

error: a “constant” variable declaration is not allowed inside a function body

That is a documented limitation of constant memory: it is available at file scope, not function scope. The other limitation is that it is limited to 64 KB. Why is function scope essential to your use case? What is the exact purpose of the tables being used?

That’s a long story… but it’s generic code that’s header only… so to change to global scope I’d have to rewrite it as non-template separate source files which sort of defeats the whole object of the exercise (I’m trying to make an existing generic library usable under CUDA as well as regular CPU).

The tables have various uses, many are coefficients to rational approximations, others are special values - factorials, prime numbers, Bernoulli numbers and so on.

Sounds like an ambitious project, you would want to make sure to use the latest CUDA version (that is, 7.5, as of this writing) which has C++11 support. Beyond that, it is difficult to give specific advice for a vaguely described use case. Is this a well-known template library like Boost?

Nod. It’s Boost.Math.

At present this is at the “lets see if this works and is worthwhile” stage, I have most of the support/utility code working on CUDA plus a couple of trivial proof-of-concept tests (cbrt/erf/tgamma). In other words nothing that CUDA doesn’t provide itself so far. Ideally, I’d like to be able to get all the stats functions ported, which means the incomplete beta/gamma, but I suspect they may be too large for the device (they compile, but try to access memory at address 0x000000, need to break them down and see what the issue is).

Nod. It’s Boost.Math.

At present this is at the “lets see if this works and is worthwhile” stage, I have most of the support/utility code working on CUDA plus a couple of trivial proof-of-concept tests (cbrt/erf/tgamma). In other words nothing that CUDA doesn’t provide itself so far. Ideally, I’d like to be able to get all the stats functions ported, which means the incomplete beta/gamma, but I suspect they may be too large for the device (they compile, but try to access memory at address 0x000000, need to break them down and see what the issue is).

I assume the interest is not necessarily in porting Boost.Math in all its generality, but rather having particular statistical functions available in CUDA for use in a particular application?

If so, you may also consider filing an RFE (enhancement request) with NVIDIA to add the incomplete gammma and incomplete beta functions to CUDA. RFEs can be filed via the bug reporting form linked from the CUDA registered developer website, simply prefix the subject line with “RFE:”.

I am not familiar with the Boost code for the incomplete gamma and incomplete beta functions, but from having taken a quick look at these functions before I cannot imagine why the code would be “too large” to run on the GPU. It would certainly be large enough that one would never want to inline these functions. Arriving at a high-performance implementation that is also accurate would be challenging as well.