The more I learn the more questions I have…. :rolleyes:
I have studied the official example of matrix multiplication. To test the performances under different data tiling size, I have changed the BLOCK_SIZE of 1, 2, 4, 8, 16 and 32. The computer is crashed, if the data tiling size is at 32 (3232 = 1024 Threads in a block). I have a Quadro FX 1700 graphic card. The questions are:
Where can I know, which max. tiling size in a block can be supported by my graphic card? (How much data can I copy from device memory to shared memory at once?)
Is that right: The parallel computations in cuda can run just in a block (threads parallelism). The calculation of blocks is still sequential. It means, a block will run on GPU just after complete work of another block.
How about is the relation between blocks and the GPU architecture? A figure in the programming guide shows the parallel blocks. Is it a paradox to 2. statement (if its right.)?