hi everyone, i"m a beginner to CUDA and I’ve tried to read the multrixMul example in the SDK. Well, I still have some probelms in understanding it, I have listed them below, would anyone kindly give some advice?
- Are all the blocks in one grid excutes at the same time or one by one?
- Take the matrixMul for example, does the share memory hold all the data of the two matrix? If so, the share memory size is 16K, there will be a frequent data transfer if the data amount is huge, just like two 512*512 matrix multiply(each element is a float number). or can you give me some description on how to manage the share memory.