Potential block size bug

First off, I am completely new to parallel programming. For a MPI class I am taking we wrote a program to calculate pi by finding the area under a curve using the midpoint rule. I decided to do the same using CUDA to compare results with the 64 node beowulf cluster we have at school.

I had the block size as (512, 1, 1) with a grid size of (1,1,1) just to get started. Everything worked fine and posted a result in 0.3 ms to compute using 1,000,000 panels. But then I changed the block size around and accidentally had a size of (513,1,1). The code ran and posted a result in 0.05 ms.

So I showed 6x performance by going to a block size that isn’t supported. Should the compiler not allow me to set a block size of 513? Why did I show an increase in performance?

System Information
CUDA Toolkit and SDK 1.0
XP Pro SP2, VS2005
AMD64 3700+, 512 MB
Quadro 4600

Any insight would be appreciated!

The kernel failed to launch. If you check the result, it will be wrong.

We are improving the error reporting/detection.

That’s what I figured, just wanted to make sure that Nvidia knew that this was happening.