I have some data which won’t fit into shared memory but must be shared by all threads in a block. This data is generated by the threads as well. Once the thread block is done this data is not needed anymore.
So the problem I have is that I do not know how many blocks will be running concurrently (this also depends on which GPU you use), so I cannot allocate some global memory and then launch the kernel b/c I will either reservce too much, or too little.
Does anybody perhaps have an idea how to solve this?