My guess is that you are not using the correct kernel launch syntax for the runtime API. The correct syntax is
Kernel <<< number of blocks, number of threads per block >>>(arguments)
So if you are passing a number larger than 512 to the first launch parameter, you are not running more than 512 threads per block. If you pass a big number as the second parameter, the should be a kernel launch failure.
I am comparing the result with a cpu matrix multiplication. About that I have another question: when I init 1k x 1k matrices with random numbers between 0-1e4 or more I get wrong results in CUDA. What is exactly the numeric limit of a float in CUDA and why is it different from the cpu one?
Depending on which type of GPU you use, single precision is either fully IEEE-754 2008 compliant, or “almost”. Unless your CPU reference result is being done in double precision or better, you really can’t judge which results might be “wrong”. More often than not, the structure of algorithms like dot products in CUDA actually produces better results than a pure serial CPU result done in the same precision, because of the use of multiply-add instructions which don’t suffer from intermediate rounding, algorithms which employ increased numbes of partial sums used to compute the result, and ordering phenomena that tends to make adjacent summation terms closer in magnitude, which reduces precision losses.
But there is still the question of how you are running code with more than 512 threads per block when the API reports 512 is the limit. Can you post a concise repro case demonstrating this amazing feat?
That code indeed doesn’t run when threadsPerBlock exceeds the limit the API reports. It will, however, happily return results from a previous, successful run from device memory, should such results exist, and that will make your checking code report a pass when nothing actually happened. This is mostly because you are resetting the random number seed in your array creation code which guarantees that every matrix of a given size will be identical. You can confirm this for yourself by (1) seeding the generator to something that changes, like something from the system clock, (2) adding a call to cudaGetLastError() after the kernel launch and checking for a non-zero result, which will indicate a launch failure, and (3) calling cudaMemset on d_C before you launch the kernel to clear any old results out of memory.