In SDK project matrixMul, the default block_size=16. I set matrix A and B both are 4096*4096 as follows,
#define BLOCK_SIZE 16
#define WA 4096
#define HA 4096
#define WB 4096
the time is 2152ms
But when I set
#define BLOCK_SIZE 32
#define WA 4096
#define HA 4096
#define WB 4096
the time is changed into 104ms
My question is:
The maximum number of threads per block is 512, and BLOCK_SIZE=32, 32*32>512.
How does it work?
the platform is a PC (AMD Athlon 64 *2 Dual Core Processor 4400+) + Telsa NVIDIA D870.
OS: Fedora 7.