question about setting block_size in matrixMul

In SDK project matrixMul, the default block_size=16. I set matrix A and B both are 4096*4096 as follows,
#define BLOCK_SIZE 16
#define WA 4096
#define HA 4096
#define WB 4096

the time is 2152ms
But when I set
#define BLOCK_SIZE 32
#define WA 4096
#define HA 4096
#define WB 4096

the time is changed into 104ms
My question is:
The maximum number of threads per block is 512, and BLOCK_SIZE=32, 32*32>512.
How does it work?
the platform is a PC (AMD Athlon 64 *2 Dual Core Processor 4400+) + Telsa NVIDIA D870.
OS: Fedora 7.

It is an illegal configuration and the kernels are not starting.
If you check the error code, you should see “failure to launch”.