I am a newbie just starting to learn CUDA Fortran (14.4 but I think the same thing happened on 14.9) so please excuse if this is a stupid question.
When I compile and run the Matmul example in Win7 Pro with VS 2010 in x64 debug model on a Tesla C2070, I get the following:
arrays sized 512 by 1024 by 512
Kernel time excluding data xfer: 0.000000 microseconds
Megaflops excluding data xfer: Inf
Total time including data xfer: 109000.0 microseconds
Megaflops including data xfer: 2462.711
C(1,1) = 3.5791874E+11
C(2,2) = 3.5739933E+11
No errors found
Press any key to continue . . .
You can imagine my excitement to learn that I have a card with infinite speed, but I suspect there is something wrong. To my novice eye, it looks like the host routine is blowing right through cudathreadsynchronize() and stopping the timer as soon as the kernel is launched. This seems like such an obvious bug (and doesn’t cause any numerical error) that I must be missing something. Any help is appreciated.