Can I allocate constant memory dynamically?

I use multithread and multiGPU.
So, as shown in the manual, I must extern the constant memory when the program loaded. This result that i can only use constant memory in device #0. If I define the constant memory in each thread, the definition will occured before cudasetdevice. And all the thread will share it( of course, it must be failed). That is what i don’t want.
Am I right?

Who know how can I allocate non-shared constant memory among threads dynamically in each thread?
thank you.

What are you trying to do ?

Constant memory is a lot slower if all threads do not access the same data. You can statically assign memory to each thread from 64kb memory pool if you want to.

Slower? The manual said constant memory has cache,so is fast as registers.

I want to put some constant variables (whose value is different in different threads, but same in same thread) into constant memory.

Reads from constant memory in the cache are fast, but the cache is optimized for broadcast to a warp. Based on the performance guidance given in the manual, I think this is because the constant cache can only service one read at a time (i.e., it has only one “bank”, to abuse a term from the shared memory description). Having threads in a warp read random locations will slow things down quite a bit, as the cache will have to serialize the requests.

There has been discussion about this before. The solution is to define constant memory in global space and use it in multi threads.

You would think it all resides in the GPU #0 but it is not. Constant memory is like implicit thread local variables.

nvcc must have done some trick on that.


I guess someone could write some memory pooling subroutines that maintain a per-GPU state. The pool would manage the 64kb of constant memory, but keep all linked lists on the host side (to conserve the precious 64kb as much as possible). The constant memory would internally be declared as one big array of 65536 bytes in the pooling library, but any pointers to areas inside the constant memory on the device are only handed out by the library as return values of the respective malloc() calls. The library could also support copying data to and from constant memory.

Is anyone up to this? ;-)