OpenCL example oclReduction fails CPU validation for certain values of --n and --threads

bkloppenborg · December 21, 2012, 10:55am

Greetings,

I have found that the oclReduction program in the NVidia OpenCL example code fails host/OpenCL comparison for certain combinations of --n (the number of elements to reduce) and --threads (the number of threads per block).

My computer specifications are as follows:
GPU: GeForce GTX 570
Nvidia driver: 304.43
NVidia CUDA SDK: 4.2
OS: Ubuntu Linux 12.04 (x86_64), running X 11.0 with Gnome 3.4 as the shell

To illustrate this issue, oclReduction will fail for all instances of --n=[33-64] when --threads=128 (the default):

$ ./oclReduction --n=33 --threads=128
[oclReduction] starting...

./oclReduction Starting...

Reducing array of type int.
GeForce GTX 570
 33 elements
 128 threads (max)
 1 blocks


Comparing against Host/C++ computation...
 GPU result = 0
 CPU result = 4861

FAILED

[oclReduction] test results...
FAILED

> exiting in 3 seconds: 3...2...1...done!

I haven’t done an exhaustive test, but so far I’ve found oclReduction fails under these conditions (n = number of elements, t = number of threads):

t: 16: n: [1024-2016] fails
t: 32, n: [33-*) fails
t: 64, n: [33-64] fails, [1025-1920] fails
t: 128, n: [33-64] fails
t: 256, n: [33-64] fails
t: 512, n: [33-64] fails

The failures on non-powers of two are not that unexpected (considering the example code explicitly states it assumes the input is a power of two); however oclReduction frequently works correctly for non-power-of-two input sizes. Why oclReduction frequently fails for n=64 and why it always fails for --n > 32 for --threads=32 threads is quite interesting.

I’ve tried turning on the --cpufinal option with various values of --cputhresh=X (this instructs the program to compute the sum for the last X-blocks on the CPU, rather than the GPU), but this does not appear to change the results. For example, doing

./oclReduction --threads=128 --n=64 --cpufinal --cputhresh=1

should force the entire computation to be done on the CPU (as there will be only one block, n < threads), but the computation still fails.

I think the issue resides in the computation of the number of blocks and threads (getNumBlocksAndThreads, oclReduction.cpp:208); however, I’m not sure how to fix it.

Can anyone else confirm similar issues? Are there any known fixes/patches?

Regards,
Brian

Topic		Replies	Views
OpenCL on 196.21 WHQL CUDA Programming and Performance	8	4705	February 21, 2010
num_threads(X) is isgnored in OpenMP target pragma for a X>128 nvc, nvc++ and nvfortran	1	648	March 6, 2023
Error oclMultiThreads Example OpenGL	0	1479	August 29, 2013
OpenCL not working with the latest version 3.2 CUDA Programming and Performance	7	13953	December 6, 2010
clCreateCommandQueue CL_INVALID_DEVICE CUDA Programming and Performance	2	5165	September 17, 2009
! Error # -32 at line 70 , in file oclTranspose.cpp ! CUDA tests all pass, OpenCL all fails, CUDA Programming and Performance	1	615	January 7, 2011
cudaOpenMP failed to pass correctResult CUDA Programming and Performance	1	615	November 29, 2016
Benchmark option in oclNbody does not work extension "NV-GLX" missing on display CUDA Programming and Performance	3	6466	November 27, 2010
trying to understand kernel parameters and CL_INVALID_WORK_GROUP_SIZE CUDA Programming and Performance	8	3980	February 26, 2010
Cuda or OpenCL 32bit - OK, 64bit - KO. Why? (460.39, nvidia-uvm) Linux cuda , kernel , linux	4	685	October 12, 2021

OpenCL example oclReduction fails CPU validation for certain values of --n and --threads

Related topics