Matrix multiplication ERRORS & few thoughts on CUDA Basic programming errors need correction

You should have a look at the matrix multiplication example in the CUDA programming guide. As was already pointed out, your kernel has issues.

clock() is an extremely imprecise time measurement. Your kernel call is probably completing in ~milliseconds or less. You need a higher resolution timer. I.e. gettimeofday on linux or queryPerformanceCounter on windoze: these are wrapped in a cross platform way by CUT which comes with the CUDA SDK. If you want extremely high precision timing of just the kernel launch, use the timers in the CUDA event API (read the programming guide).

What do you mean? There is no FSB on the GPU. The GTX 280, for instance, has a 512-bit memory bus tied directly from the RAM to the memory manager on the GPU. It is capable of feeding the GPU with ~140 GiB/s of bandwidth (read or write).

Because the hardware scheduler was designed that way.

Note that you can’t always run 512 threads in a block depending on the number of registers used in the kernel.