Greetings,
I have found that the oclReduction program in the NVidia OpenCL example code fails host/OpenCL comparison for certain combinations of --n (the number of elements to reduce) and --threads (the number of threads per block).
My computer specifications are as follows:
GPU: GeForce GTX 570
Nvidia driver: 304.43
NVidia CUDA SDK: 4.2
OS: Ubuntu Linux 12.04 (x86_64), running X 11.0 with Gnome 3.4 as the shell
To illustrate this issue, oclReduction will fail for all instances of --n=[33-64] when --threads=128 (the default):
$ ./oclReduction --n=33 --threads=128
[oclReduction] starting...
./oclReduction Starting...
Reducing array of type int.
GeForce GTX 570
33 elements
128 threads (max)
1 blocks
Comparing against Host/C++ computation...
GPU result = 0
CPU result = 4861
FAILED
[oclReduction] test results...
FAILED
> exiting in 3 seconds: 3...2...1...done!
I haven’t done an exhaustive test, but so far I’ve found oclReduction fails under these conditions (n = number of elements, t = number of threads):
t: 16: n: [1024-2016] fails
t: 32, n: [33-*) fails
t: 64, n: [33-64] fails, [1025-1920] fails
t: 128, n: [33-64] fails
t: 256, n: [33-64] fails
t: 512, n: [33-64] fails
The failures on non-powers of two are not that unexpected (considering the example code explicitly states it assumes the input is a power of two); however oclReduction frequently works correctly for non-power-of-two input sizes. Why oclReduction frequently fails for n=64 and why it always fails for --n > 32 for --threads=32 threads is quite interesting.
I’ve tried turning on the --cpufinal option with various values of --cputhresh=X (this instructs the program to compute the sum for the last X-blocks on the CPU, rather than the GPU), but this does not appear to change the results. For example, doing
./oclReduction --threads=128 --n=64 --cpufinal --cputhresh=1
should force the entire computation to be done on the CPU (as there will be only one block, n < threads), but the computation still fails.
I think the issue resides in the computation of the number of blocks and threads (getNumBlocksAndThreads, oclReduction.cpp:208); however, I’m not sure how to fix it.
Can anyone else confirm similar issues? Are there any known fixes/patches?
Regards,
Brian