OpenCL example oclReduction fails CPU validation for certain values of --n and --threads

Greetings,

I have found that the oclReduction program in the NVidia OpenCL example code fails host/OpenCL comparison for certain combinations of --n (the number of elements to reduce) and --threads (the number of threads per block).

My computer specifications are as follows:
GPU: GeForce GTX 570
Nvidia driver: 304.43
NVidia CUDA SDK: 4.2
OS: Ubuntu Linux 12.04 (x86_64), running X 11.0 with Gnome 3.4 as the shell

To illustrate this issue, oclReduction will fail for all instances of --n=[33-64] when --threads=128 (the default):

$ ./oclReduction --n=33 --threads=128
[oclReduction] starting...

./oclReduction Starting...

Reducing array of type int.
GeForce GTX 570
 33 elements
 128 threads (max)
 1 blocks


Comparing against Host/C++ computation...
 GPU result = 0
 CPU result = 4861

FAILED

[oclReduction] test results...
FAILED

> exiting in 3 seconds: 3...2...1...done!

I haven’t done an exhaustive test, but so far I’ve found oclReduction fails under these conditions (n = number of elements, t = number of threads):

t: 16: n: [1024-2016] fails
t: 32, n: [33-*) fails
t: 64, n: [33-64] fails, [1025-1920] fails
t: 128, n: [33-64] fails
t: 256, n: [33-64] fails
t: 512, n: [33-64] fails

The failures on non-powers of two are not that unexpected (considering the example code explicitly states it assumes the input is a power of two); however oclReduction frequently works correctly for non-power-of-two input sizes. Why oclReduction frequently fails for n=64 and why it always fails for --n > 32 for --threads=32 threads is quite interesting.

I’ve tried turning on the --cpufinal option with various values of --cputhresh=X (this instructs the program to compute the sum for the last X-blocks on the CPU, rather than the GPU), but this does not appear to change the results. For example, doing

./oclReduction --threads=128 --n=64 --cpufinal --cputhresh=1

should force the entire computation to be done on the CPU (as there will be only one block, n < threads), but the computation still fails.

I think the issue resides in the computation of the number of blocks and threads (getNumBlocksAndThreads, oclReduction.cpp:208); however, I’m not sure how to fix it.

Can anyone else confirm similar issues? Are there any known fixes/patches?

Regards,
Brian