I’ve been trying to understand the various memory types in CUDA, but its unclear to me which is right for what I want to do. I have a table of 512 floats that I computed in advance. Each thread randomly selects one of them with equal probability and uses it as a seed value for its calculations. I tried using constant memory, but it was surprisingly slow.

Would shared memory be the right way to do this? And if so, its unclear to me how I allocate it. I understand that it must be allocated from the kernel, but if its shared how do I know which thread should allocate and which threads should assume its already been allocated?