CUDA global static data alternative?

I’m building a toolkit that offers different algorithms in CUDA. However, many of these algorithms use static constant global data that will be used by all threads, declared this way for example:

static device constant real buf[MAX_NB];

My problem is that if I include all the .cuh files in the library, when the library will be instantiated all this memory will be allocated on the device, even though the user might want to use only one of these algorithms. Is there any any way around this? Will I absolutely have to use the typical dynamically allocated memory?

I want the fastest constant memory possible that can be used by all threads at runtime. Any ideas?

Thanks!

I can offer several ideas:

  • If you intend to run your toolkit on Fermi-based cards only, then the GPU cache can provide sufficient memory access acceleration. Hence, you might need to use no constant memory at all. This statement is too generic to be universally true, I guess it depends on how you access the memory. I’m not experienced enough to elaborate on this. Anyway, this solution won’t work with pre-Fermi cards.

  • You can use plain old #ifdef USE_FEATURE_BLA to surround both your static device variable declarations, as well as the kernels which access this memory. The user of your toolkit will compile your toolkit together with his code, passing the appropriate -DUSE_FEATURE_BLA options to nvcc, or, more naturally, having #defined USE_FEATURE_BLA in his code before #including bla.cuh.

  • You toolkit can ship as a set of CPU-level functions, which have all the GPU stuff already compiled into them. In that case the user of your toolkit won’t have to #include bla.cuh at all, but only bla.h instead, and hence the problem simply goes away.

Typically, an author of a library would offer a number of its usage patterns, spreading from the lower-level and extending into the higher-level use cases. For example, nVidia offers both the lower-level (driver) and higher-level (user-space) APIs, for more or less demanding users, respectively. The last two approaches listed above can follow this convention hand-in-hand.