Shared memory issues Initialization of shared memory

Hello all,

I just started using CUDA, but I bumped into a little issue.

I have a grid of blocks, and inside each block all threads access the same shared memory block to perform operations. I really need to make full use of my shared memory since I need to operate on quite large data sets. I want to be able to store a 1D array of up to 4K float values (or a bit lower to avoid coming to close to the 16 KB limit). Splitting this is not trivial because of the nature of my algorithm.

Now the following questions :

  • If I have more blocks than multiprocessors (which is always the case), can this cause failure of execution because the runtime tries to allocate several blocks concurrently to one multiprocessor and runs out of shared memory? I suppose the blocks will just be serialized, but am not sure.

  • Is there a way to initialize the shared memory? All threads within one block will add values (probably from texture memory) to the shared memory, in a non-trivial sequence (therefore requiring the syncthreads()). However, the memory should be initialized to ‘0’ before any of these operations are performed. I can not let each thread initialize a certain block of shared memory to ‘0’, since this might erase previous values.

Hope you guys can help me out a bit. Thanks in advance.

CUDA will vary the number of blocks that are executed concurrently. If your kernel uses close to 16KB, then you can only run 1 block at a time (serialized). If your kernel uses less than 8KB, then CUDA will schedule 2 blocks to run concurrently, assuming other constraints are met (e.g. enough registers). And so forth. I guess the only time it fails to launch is when you use more than 16KB of shared memory.

Shared memory is shared only within 1 block. It is not shared between different blocks. So you cannot store values in shared memory for subsequent blocks to use.

Thanks for this information.

Maybe I did not make myself entirely clear. The shared memory is only accessed within the block itself, it should just be initialised before anything else happens. But looking a little bit more into the docs, it seems it should not be to difficult to let each thread set SHARED_MEM_SIZE/NUMBER_OF_THREADS_PER_BLOCK to zero, followed by a synchthread command.