I just started using CUDA, but I bumped into a little issue.
I have a grid of blocks, and inside each block all threads access the same shared memory block to perform operations. I really need to make full use of my shared memory since I need to operate on quite large data sets. I want to be able to store a 1D array of up to 4K float values (or a bit lower to avoid coming to close to the 16 KB limit). Splitting this is not trivial because of the nature of my algorithm.
Now the following questions :
If I have more blocks than multiprocessors (which is always the case), can this cause failure of execution because the runtime tries to allocate several blocks concurrently to one multiprocessor and runs out of shared memory? I suppose the blocks will just be serialized, but am not sure.
Is there a way to initialize the shared memory? All threads within one block will add values (probably from texture memory) to the shared memory, in a non-trivial sequence (therefore requiring the syncthreads()). However, the memory should be initialized to ‘0’ before any of these operations are performed. I can not let each thread initialize a certain block of shared memory to ‘0’, since this might erase previous values.
Hope you guys can help me out a bit. Thanks in advance.