oclScan

pcmaster · April 5, 2010, 10:01pm

Hi!

I wonder, what did they follow with “dividing” the N (6.5x1024x1024 ~ 6.5M) elements into “batches” and “arrays” (batchSize x arrayLenght =always= N).

If I set all input elements to 1 (for check), I expect the exclusive prefix scan result to look like:

0, 1, 2, 3, 4, ..., N-2

(N ~ 6.5M)

However, I noticed, that for any arrayLength, for example 4096, the output is looks like:

0, 1, 2, 3, ..., 4095, 0, 1, 2, 3, ..., 4095, ..., ..., ..., 0, 1, 2, 3, ..., 4095

So, it does not calculate the whole prefix sum for each of the N (~6.5M elements).

I am confused about the batchSizes and arrayLengths. Does anyone see into it? What do they miss to make it possible to scan arbitrary sized arrays (like 32M ints and similar sizes, which still, with the auxiliary buffers, fit into VRAM)? To me it looks like they completely miss one loop in the host code!

I’ve read most of the papers on parallel reductions and prefix scans I could find, including CUDA samples and papers, yet I cannot get scanning huge arrays right :-(

Pourya · December 26, 2012, 7:10am

I have the same issue with this sample. It’s time to right our own!

vcosta · December 29, 2012, 10:52pm

What I ended up doing was downloading PyOpenCL and then extracting the scan OpenCL/C code from that. It works perfectly.

The forum chokes if I try to paste the code so you can find the kernel source here:
[url]http://pastebin.com/vzTS8rN1[/url]

PS: this is the code for an inclusive scan which is what I use. PyOpenCL supports both exclusive and inclusive scans.

Topic		Replies	Views
Cuda Samples Scan Query CUDA Programming and Performance	5	913	September 2, 2014
Parallel Prefix Sum (Scan) SDK Sample with single block CUDA Programming and Performance	2	6470	December 5, 2011
prefix_sum, can not syncthreads CUDA Programming and Performance	1	430	February 22, 2017
scanLargeArray over 20 million elements CUDA Programming and Performance	2	4402	May 20, 2010
mirroring array CUDA Programming and Performance	7	590	November 5, 2019
Question about the scanLargeArray When I change the number of elements... CUDA Programming and Performance	5	9107	October 12, 2007
Understanding NVidia separable convolution example CUDA Programming and Performance	0	2294	March 28, 2012
Kernel call costs? CUDA Programming and Performance	6	3554	July 6, 2009
Trouble with first OpenCL program CUDA Programming and Performance	0	4026	January 9, 2012
problem with scan algorithm CUDA Programming and Performance	4	860	February 21, 2015

oclScan

Related topics