Threads Per Block Issue

Hi,

I am new to CUDA programming. I have a confusion regarding threads per block issue. I am using GTX-285 which supports 512 threads per block.

During the testing, I checked my program with 1024 threads per block and it worked without any errors! Can anyone tell me why it is so? Also I am getting much better performance when I use 1024 threads per block.
Thank You

Regards,
M. Awais

Failure to find errors usually stems from a failure to look for them. If you check for errors straight after kernel launch with a cudaGetLastError() call, I am willing to bet that your kernels never launch with an invalid execution argument error. This is the reason for your observed “much better performance”: kernels not actually running at all, but failing to launch, which is much faster than when they actually run. If you are seeing good looking results in memory you are copying back from the device, it is probably left over from a 512 or less thread block run which left them in memory. Device memory isn’t cleared or touched from context to context.

Failure to find errors usually stems from a failure to look for them. If you check for errors straight after kernel launch with a cudaGetLastError() call, I am willing to bet that your kernels never launch with an invalid execution argument error. This is the reason for your observed “much better performance”: kernels not actually running at all, but failing to launch, which is much faster than when they actually run. If you are seeing good looking results in memory you are copying back from the device, it is probably left over from a 512 or less thread block run which left them in memory. Device memory isn’t cleared or touched from context to context.