I wonder, what did they follow with “dividing” the N (6.5x1024x1024 ~ 6.5M) elements into “batches” and “arrays” (batchSize x arrayLenght =always= N).

If I set all input elements to 1 (for check), I expect the exclusive prefix scan result to look like:

0, 1, 2, 3, 4, ..., N-2

(N ~ 6.5M)

However, I noticed, that for any arrayLength, for example 4096, the output is looks like:

0, 1, 2, 3, ..., 4095, 0, 1, 2, 3, ..., 4095, ..., ..., ..., 0, 1, 2, 3, ..., 4095

So, it does not calculate the whole prefix sum for each of the N (~6.5M elements).

I am confused about the batchSizes and arrayLengths. Does anyone see into it? What do they miss to make it possible to scan arbitrary sized arrays (like 32M ints and similar sizes, which still, with the auxiliary buffers, fit into VRAM)? To me it looks like they completely miss one loop in the host code!

I’ve read most of the papers on parallel reductions and prefix scans I could find, including CUDA samples and papers, yet I cannot get scanning huge arrays right :-(

I have the same issue with this sample. It’s time to right our own!

What I ended up doing was downloading PyOpenCL and then extracting the scan OpenCL/C code from that. It works perfectly.

The forum chokes if I try to paste the code so you can find the kernel source here:

PS: this is the code for an inclusive scan which is what I use. PyOpenCL supports both exclusive and inclusive scans.