The more I learn the more questions I haveâ€¦. :rolleyes:
I have studied the official example of matrix multiplication. To test the performances under different data tiling size, I have changed the BLOCK_SIZE of 1, 2, 4, 8, 16 and 32. The computer is crashed, if the data tiling size is at 32 (3232 = 1024 Threads in a block). I have a Quadro FX 1700 graphic card. The questions are:
- Where can I know, which max. tiling size in a block can be supported by my graphic card? (How much data can I copy from device memory to shared memory at once?)
- Is that right: The parallel computations in cuda can run just in a block (threads parallelism). The calculation of blocks is still sequential. It means, a block will run on GPU just after complete work of another block.
- How about is the relation between blocks and the GPU architecture? A figure in the programming guide shows the parallel blocks. Is it a paradox to 2. statement (if its right.)?
- If I have a larger matrix as the one in the example, how can I improve the example?
*here is the performance of fx1700
Quadro FX 1700
Memory Size 512MB
Memory Interface 128-bit
Graphic Memory Bandwidth 12.8 GB/sec.
Graphics Bus PCI Express 2.0
CUDA Parallel Processor Cores 32
Thx a lot!!