I need to compute the prefix sum of very large arrays, over 20million elements. I’ve been digging over the SDK sample, but I can’t get it to work over 16777218 elements, as all tests fail.
Is there any limit? As far as I understand, the limit should be 33553920 (65535*512), right?
You should try the Thrust library, or CUDPP. Their scan primitives will work for large inputs and they should also be significantly faster than the scan from the SDK.
It looks like the scan sample uses a block size of 256 threads, so that would explain your 16M limit (65535 * 256).
You could modify the code to use 512-thread blocks (which might affect the performance), or use a 2D grid (allowing 65535 * 65535 * 256 items).
Have you looked at CUDPP, it might not have this restriction?