Parallel Sum Reduction Problem

Hello everybody

I am trying to get familiar with opencl. To do so, I tried tro write a kernel which is supposed to calculate the sum of vector elements. I used the “Parallel reduction without shared memory bank conflicts” from the “OpenCL Programming for the CUDA Architecture” document provided by nvidia.

My GPU: nvidia quadro nvs 140m

The Kernel

__kernel void sum(__global const float *A,__global float *C,uint size, __local float *L) {

		float sum=0;

		for(int i=get_global_id(0);i<size;i+=get_local_size(0))

			sum+=A[i];

		L[get_local_id(0)]=sum;

		

		for(uint c=get_local_size(0)/2;c>0;c/=2)

		{

			barrier(CLK_LOCAL_MEM_FENCE);

			if(c<get_local_id(0))

				L[get_local_id(0)]+=L[get_local_id(0)+c];

			

		}

		if(get_local_id(0)==0)

			C[0]=L[0];

		barrier(CLK_LOCAL_MEM_FENCE);

}

The local work size is 256 and the global work size is the next upper multiple of 256.

If I run my program with a vector containting 1000 elements which are all 1, the kernel return 4 (instead of 1000 of course). I assume it has something to do with the barrier.

Additionaly, the kernel is at least one order of magnitude slower than the cpu.

I would be happy for any advice on what the problem could be.

I got it to work now. I made two mistakes. The kernel is now the following:

__kernel void sum(__global const float *A,__global float *C,uint size, __local float *L) {

		float sum=0;

		for(int i=get_local_id(0);i<size;i+=get_local_size(0))

			sum+=A[i];

		L[get_local_id(0)]=sum;

		

		for(uint c=get_local_size(0)/2;c>0;c/=2)

		{

			barrier(CLK_LOCAL_MEM_FENCE);

			if(c>get_local_id(0))

				L[get_local_id(0)]+=L[get_local_id(0)+c];

			

		}

		if(get_local_id(0)==0)

			C[0]=L[0];

		barrier(CLK_LOCAL_MEM_FENCE);

}

However, it is still one order of magnitude slower than the sequential implementation. Any suggestions. Thank you.

This kernel appears to work with a single work group only. How are you using it?
Also, I could not find any differences between the code of the two postings. What changed?

Regarding performance comparison with a sequential implementation, try running both with an increasing number of elements and see what happens. Up to as much as your device memory allows.

Thank you for your answer Martin.

I hope i understood correctly what information you need:

clEnqueueNDRangeKernel(OCL_Command_Queue,_Sum, 1, NULL,&global_item_size,&local_item_size, 0, NULL, &event);

local_item size is 512

global_item_size is the next multiple compared to the number of elements in the vector.

The first for loop changed from

for(int i=get_global_id(0);i<size;i+=get_local_size(0))

to

for(int i=get_local_id(0);i<size;i+=get_local_size(0))

“get_global_id(0)” -> “get_local_id(0)”

Then the “<” changed to “>” in

if(c>get_local_id(0))

I already tried that. However, the kernel scales worse than the sequential implementation.

I looked through the OpenCL specs and I realised why the example given in “OpenCL Programming for the CUDA Architecture” is running on only one workgroup. The reason for this is that synchronisation is only possible between work items of the same work group.

Does using more work groups with less work items result in a speed up?

Furthermore, I did some additional benchmarks, and I found that the kernel once it is running is faster than the sequential implementation. Unfortunately, 95% of the time between enqueueing the kernel and a successfull finish, the kernel is just CL_QUEUED. What are the factor influencing the spped at which a kernel is submitted after being enqueued? This effet is also the reason why the parallel implememntation scales worse than the sequential implementation.

GPUs are built for massively parallel numerically intensive computations. Using a single work group to simply sum the elements of a vector does not belong to this category. The next step in order to improve the performance of your implementation is to launch several work groups that each sum a section of the vector and store the result to a new, smaller vector. Then run the kernel again on the smaller vector and repeat until a single element remains.

One has to be careful when searching for bottlenecks in a asynchronous system, which a OpenCL machine is. How are you measuring the time between enqueuing, and finish? Have you used any of the profiling capabilities that OpenCL have to determine how much of the time is spent running the kernel and how much is overhead elsewhere? See “Profiling Operations on Memory Objects and Kernels” in the specification.