First off, I am completely new to parallel programming. For a MPI class I am taking we wrote a program to calculate pi by finding the area under a curve using the midpoint rule. I decided to do the same using CUDA to compare results with the 64 node beowulf cluster we have at school.
I had the block size as (512, 1, 1) with a grid size of (1,1,1) just to get started. Everything worked fine and posted a result in 0.3 ms to compute using 1,000,000 panels. But then I changed the block size around and accidentally had a size of (513,1,1). The code ran and posted a result in 0.05 ms.
So I showed 6x performance by going to a block size that isn’t supported. Should the compiler not allow me to set a block size of 513? Why did I show an increase in performance?
System Information
CUDA Toolkit and SDK 1.0
XP Pro SP2, VS2005
AMD64 3700+, 512 MB
Quadro 4600
Any insight would be appreciated!