Some questions on CUDA For help

hi everyone, i"m a beginner to CUDA and I’ve tried to read the multrixMul example in the SDK. Well, I still have some probelms in understanding it, I have listed them below, would anyone kindly give some advice?

  1. Are all the blocks in one grid excutes at the same time or one by one?
  2. Take the matrixMul for example, does the share memory hold all the data of the two matrix? If so, the share memory size is 16K, there will be a frequent data transfer if the data amount is huge, just like two 512*512 matrix multiply(each element is a float number). or can you give me some description on how to manage the share memory.

thanks again