I use multithread and multiGPU.
So, as shown in the manual, I must extern the constant memory when the program loaded. This result that i can only use constant memory in device #0. If I define the constant memory in each thread, the definition will occured before cudasetdevice. And all the thread will share it( of course, it must be failed). That is what i don’t want.
Am I right?
Who know how can I allocate non-shared constant memory among threads dynamically in each thread?
thank you.
Constant memory is a lot slower if all threads do not access the same data. You can statically assign memory to each thread from 64kb memory pool if you want to.
Reads from constant memory in the cache are fast, but the cache is optimized for broadcast to a warp. Based on the performance guidance given in the manual, I think this is because the constant cache can only service one read at a time (i.e., it has only one “bank”, to abuse a term from the shared memory description). Having threads in a warp read random locations will slow things down quite a bit, as the cache will have to serialize the requests.
I guess someone could write some memory pooling subroutines that maintain a per-GPU state. The pool would manage the 64kb of constant memory, but keep all linked lists on the host side (to conserve the precious 64kb as much as possible). The constant memory would internally be declared as one big array of 65536 bytes in the pooling library, but any pointers to areas inside the constant memory on the device are only handed out by the library as return values of the respective malloc() calls. The library could also support copying data to and from constant memory.