problem with scan algorithm

Dear all,
I have programmed a simple inefficient scan algorithm for
teaching purpose.

I apply it on an integer array filled with ‘1’ and I use the sum.
It works fine for sizes of the array from 128 to 8192, but for 16384
and higher values I get inexact results.

if size=128 I have :
CUDA max threads per block=1024
block size=1024
grid size=8
shared memory size=4096
and the result is 8192

but for size=16384
CUDA max threads per block=1024
block size=1024
grid size=16
shared memory size=4096
I get a result of 8193 !!! instead of 16384

This kernel is basically the following I can give the full program if needed:

global void reduction_v0(int *ind, int size) {
int i = blockIdx.x * blockDim.x + threadIdx.x;

// perform reduction in global mem
for (int stride = 1; stride < size; stride *= 2) {
	// modulo is slow
	if ((i % (2*stride)) == 0) {
		ind[i] += ind[i + stride];
	}
	__syncthreads();
}

}

I can’t figure out why I have this inexact result. Can you give me some insight ?
I am working on a GeForce GTX 560 Ti
Regards,
JMR

code_reduction_example.cu (3.3 KB)

“// perform reduction in global mem”

one can not in good faith do reductions/ scans in global memory, particularly with the scans/ reductions spilling over into multiple blocks and grids, as it violates the underlying premises of the reductions/ scans

the scan/ reduction in totality assumes that the scan occurs in cohesion
when thread i is to increment its next element pointer by stride, it assumes that that element is updated and ready to be read:

ind[i] += ind[i + stride]; >> ind[i + stride] must be ready

global memory does not explicitly add synchronization, and you add none yourself
and blocks/ grids do not promise to preserve block sequencing nor synchronization either

i am surprised it works for the smaller counts

Sorry I made a mistake it is not a scan but reduction algorithm.

I was expecting that the __syncthreads() would let the writes to global memory execute and finish so that the data are all updated but you tell me it is not the case.

Would a __threadfence() allow me to solve the problem ?

“Sorry I made a mistake it is not a scan but reduction algorithm”

and the difference being…?
scans/ reductions are much birds of the same feather
regardless of whether it is a scan or a reduction, you would violate the underlying premise

“I was expecting that the __syncthreads() would let the writes to global memory execute and finish so that the data are all updated but you tell me it is not the case.”

__syncthreads() can only synchronize (threads) within blocks, not across blocks

"Would a __threadfence() allow me to solve the problem ? "

to the extent that __threadfence() differs from __syncthreads(), no

lets bring in numbers
i am to use elementary numbers, to make it easier; the principle applies nevertheless
you are using device x, with y sm’s, with the constraints and conditions of device x’s compute capability
you wish to scan/ reduce 2000 elements
your device can only seat 10 blocks of 100 threads each, at a time, meaning 1000 elements

ind[i] += ind[i + stride]

when thread i gets to a stride that would imply i + stride = 1500, it expects element 1500 is ready
but element 1500 can not be ready; in fact, element 1001 - 2000 would not be ready
and so the reduction/ scan breaks down, because of a violation of its premises

"“Sorry I made a mistake it is not a scan but reduction algorithm”

and the difference being…?"

scan is a little more complex than reduction ;-) I have implemented the up-sweep and down-sweep phases for scan.

Thanks for the explanation about data not ready it is clear.