If I have a kernel with 16x16 blocks and 32x32 grids (total of 262,144 threads), and inside that kernel, I declare an array
Assuming it’s put into local memory (which I believe it always should be), how much memory would be allocated? Would it allocate the entire 1K for every thread? And if I wanted to preallocate global memory so I could coalesce access, is there a better way than allocating 256 MB up front?
I can’t comment on how it is actually done, but there is no reason to reallocate the memory for every CUDA thread. Ideally, only enough memory for every hardware thread should be allocated. When some threads finish and others take their place, they should reuse the existing memory that was allocated for the previous thread.
I don’t see how you could do it like that since there’s no way to index based on it. I’ve tried using just enough for every thread in a block, but of course, that gets overwritten when another block gets swapped on the MP.
To give a little more context of why I’m confused by this, I did exactly what I’m talking about with one array (preallocated 256 MB and indexed by (array of index)*(total num of threads)+thread index) and got about a 10% speed increase (~35 ms to ~32 ms). However, I did the same thing with an array of shorts (so 128 MB), and it slowed down by about 30% (~55 ms). (If I preallocate just the array of shorts and not the array of floats, it’s about ~58 ms.)
If you are up to doing something fancy, you can run your own memory allocation scheme on the GPU. If you are on a compute capability 1.3 device, you can read the multiprocessor id from the %smid register in ptx. Unfortunately I don’t know of any way to identify different blocks on the same multiprocessor, so you would still have to run an allocation scheme between them e.g. using atomic bitops.