I wonder, what did they follow with “dividing” the N (6.5x1024x1024 ~ 6.5M) elements into “batches” and “arrays” (batchSize x arrayLenght =always= N).
If I set all input elements to 1 (for check), I expect the exclusive prefix scan result to look like:
0, 1, 2, 3, 4, ..., N-2
(N ~ 6.5M)
However, I noticed, that for any arrayLength, for example 4096, the output is looks like:
0, 1, 2, 3, ..., 4095, 0, 1, 2, 3, ..., 4095, ..., ..., ..., 0, 1, 2, 3, ..., 4095
So, it does not calculate the whole prefix sum for each of the N (~6.5M elements).
I am confused about the batchSizes and arrayLengths. Does anyone see into it? What do they miss to make it possible to scan arbitrary sized arrays (like 32M ints and similar sizes, which still, with the auxiliary buffers, fit into VRAM)? To me it looks like they completely miss one loop in the host code!
I’ve read most of the papers on parallel reductions and prefix scans I could find, including CUDA samples and papers, yet I cannot get scanning huge arrays right :-(