I have some data which won’t fit into shared memory but must be shared by all threads in a block. This data is generated by the threads as well. Once the thread block is done this data is not needed anymore.
So the problem I have is that I do not know how many blocks will be running concurrently (this also depends on which GPU you use), so I cannot allocate some global memory and then launch the kernel b/c I will either reservce too much, or too little.
Does anybody perhaps have an idea how to solve this?
I think the only safe way to do it is to allocate a global memory scratch which is big enough to accommodate every block and index into it by block id, which is potentially going to consume a lot global memory and be very slow compared to using shared memory.
Are you certain there are not any algorithm changes you could make get blocks to run on shared memory alone?
Thnx for the reply. Allocating global memory could indeed work. But that would be a big waste of memory since if you have a couple of thousand blocks and only, for example, 64 can be run concurrently. There is also not enough global memory available for my need in such a situation.
I could place my problem into shared memory but it would severely limit the input length of the algorithm which i, of course, rather not have.
For myself, I would consider this kind of strategy:
1-get the actual number of SP (or MP * 8) of the GPU
2-get (same call) the actual video memory available on this GPU
3-calculate how much threads may be launched that is the minimum given
a-Scalar Processor Register used (will probably limit!)
b-MP * 192 (192 threads per MP, 24 per SP)
c-the maximal number of threads given the Videocard memory size
4-Try to coalesce accesses as much as possible, using shared memory as scratch pad