Parallel prefix-scan with multiple blocks of gpu core

Hi All,

I was trying to implement the parallel scan with multiple blocks of gpu-cores. There are 3 blocks. Each block perform the parallel scan saperately. After the up-sweep operation I store the last element ,which will be added to the first element of the next gpu-block. I am storing the last element in the device int last_element.

But last_elemnt is not storing the value. Can anyone please tell me, why it is not working. Or is their any alternative solution?

I am attatching my code.

device int last_element=0;

global void parallel_scan (int *a){

extern __shared__ int temp[];

int idx;

int b;

int d;

    int offset;

    int dummy;

for(b=0;b<3;b++){

	offset=1;

	if(blockIdx.x==b){  // exactly one block can perform parallel scan operation.	

		

		idx=threadIdx.x;

		temp[idx]=a[blockIdx.x*blockDim.x+threadIdx.x]; //allocating to the shared memory

		

		if(idx==0){  

			temp[idx]=last_element+temp[idx];  // adding the last element from the previous block after the up-sweep operation to the first element of current block. 

		}

		

		for(d=blockDim.x>>1;d>0;d>>=1){  //up-sweep

                        	

			__syncthreads();

			if(idx < d){

                                  	temp[offset*(2*idx+2)-1]=temp[offset*(2*idx+2)-1]+temp[offset*(2*idx+1)-1];

                        	}

                        	offset=offset<<1;

                    }

		

		if(idx==0){

			last_element=temp[blockDim.x-1]; //storing the last element after the up sweep operation

			temp[blockDim.x-1]=0;  // clearing the last-element.

		}

		for(d=1;d<blockDim.x;d<<=1){ //down-sweep operation

                    	

			offset=offset>>=1;

                       	__syncthreads();

                       	if(idx<d){

                            	dummy=temp[offset*(2*idx+1)-1];

                            	temp[offset*(2*idx+1)-1]=temp[offset*(2*idx+2)-1];

                            	temp[offset*(2*idx+2)-1]+=dummy;

                        	}

                    }

a[blockIdx.x*blockDim.x+threadIdx.x]=temp[idx]; //storing the array element from shared memory to global memory

	 }

}	

}
parallel_scan_with_multiple_gpu_block.cu (1.77 KB)

I did not followed your complete code, because its so unreadable, why should someone take its time to help if you dont take your time to make it at least readable.

First think came into my mind about your code:

Blocks are executed in parallel!, how do you know which is the previous block from a single integer!

If you do a reduction on block level using multiple blocks you should save the results in an array which size is equal to the number of used blocks.

Terminate your kernel and if you used so many blocks that you need a new GPU reduction, do a new GPU reduction on that result array.

Thanks for your reply. I am apologizing for my mistake. Initially my code was indented. but after the submission it was not indented, because I didn’t notice the option for code. I am new in this forum.

The following code segment ensures that exactly one block of gpu-cores can perform parallel scan operation.

for(block=0;block<3;block++)

{

            //This condition ensures that blocks from 0 to 2 can perform parallel scan operation one by one in ascending order.

            if(blockIdx.x==block)

            {

}

}

Actually my goal is to perform parallel prefix sum operation of 24 integers on block level. Initially, elements are stored in the device memory. In this implementation, I have used 24 gpu-cores divided into 3 blocks. I need to use shared memory for faster performance. At first, I store first 8 elements into the shared memory. So that the 8 gpu-cores belong to block with blockId 0 can perform parallel-prefix sum operation. I need to store the last element after the up-sweep operation, because sum of first 8 elements can be added to the first element of the next 8 elements. After the parallel prefix sum operation in 0 block, I store next 8 elements in device into the shared memory “temp” and this time 8 gpu-cores belong to block with blockId 1 perform parallel prefix sum operation in parallel. But before starting the parallel prefix sum, I add the last element after the up-sweep operation which is stored in the last_element device variable, so that sum of the previous 8 elements can propagate to the current 8 elements. Again, I store the last element after the up-sweep operation, so that the sum of these 8 elements can be added to the first element of the next 8, which will eventually get their prefix sum by the parallel prefix sum operation of the 8 gpu-cores of blockId 2.

I am attaching my code with proper indention and comment. Hopefully this time, you won’t be disappointed.

The problem I am facing is, last_element is not getting the value.

The following code segment is for storing the last element temp[blockDim.x-1] to the device variable last_element after the up-sweep operation.

if(idx==0)

{      

      //storing the last element after the up sweep operation 

      last_element=temp[blockDim.x-1];

      // clearing the last-element. 

      temp[blockDim.x-1]=0;  

}

The following is my kernel implementation.

#include <stdio.h>

#include <assert.h>

#include <cuda.h>

#include <math.h>

#define N 24

__device__ int last_element=0;

__global__  void parallel_scan (int *a)

{

		

	extern __shared__ int temp[];

	int idx;

	int block;

	int d;

        int offset;

        int dummy;

	for(block=0;block<3;block++)

        {

		offset=1;

// exactly one of gpu-cores block can perform parallel scan operation.	

		if (blockIdx.x==block)

                {  

			

			idx=threadIdx.x;

                        //allocating elements from device to the shared memory 

                        temp[idx]=a[blockIdx.x*blockDim.x+threadIdx.x]; 

			

			if(idx==0)

                        {  

                                // adding the last element from the previous block after the up-sweep operation to the first element of current block. 

                                temp[idx]=last_element+temp[idx];  			

                        }

			

                        //up-sweep operation

			for(d=blockDim.x>>1;d>0;d>>=1)

                        {  

				__syncthreads();

				if(idx < d)

                                {

                                      	temp[offset*(2*idx+2)-1]=temp[offset*(2*idx+2)-1]+temp[offset*(2*idx+1)-1];

                                }

                            	offset=offset<<1;

                        }

			

			if(idx==0)

                        {      

                                //storing the last element after the up sweep operation 

				last_element=temp[blockDim.x-1];

                                // clearing the last-element. 

				temp[blockDim.x-1]=0;  

			}

//down-sweep operation 

			for(d=1;d<blockDim.x;d<<=1)

                        { 	

				offset=offset>>=1;

                           	__syncthreads();

                           	if(idx<d)

                                {

                                	dummy=temp[offset*(2*idx+1)-1];

                                	temp[offset*(2*idx+1)-1]=temp[offset*(2*idx+2)-1];

                                	temp[offset*(2*idx+2)-1]+=dummy;

                            	}

                        }

//storing the array element from shared memory to global memory 

                        a[blockIdx.x*blockDim.x+threadIdx.x]=temp[idx]; 

		 }

	}	

 }

You can’t be sure that blocks are launched one after another. What will happen instead is that CUDA will launch all three blocks at the same time, and put them on different cores, so there are no guarantees that, for example, block #0 will write into last_element before block #1 tries to read it.

Is there any way to do this? Or is there any alternative solution?

The only way I know is to launch them as separate kernels.

I wrote some articles on this problem. This is the basic one that introduces scan from scratch:
http://www.moderngpu.com/intro/scan.html

The ‘multiscan’ part establishes scan as a hierarchical process.

I continued this with a global scan algorithm that does an upsweep (count), reduce, and downsweep pass:
http://www.moderngpu.com/scan/globalscan.html

That latter page is going to be heavily edited when I get a moment. The bank conflict discussion at the bottom is better covered here:
http://www.moderngpu.com/intro/workflow.html#Transpose

But in a nutshell, you can spawn as many blocks as you want, but you don’t know when they get evaluated. Have each block write its total to global mem, launch another kernel to scan all of those, then launch a third kernel to downsweep. You have the general idea but are trying to do too much in a single kernel.

Also if your sequence is very small, you should just use a single block (or even just a single warp in a block) to perform the scan.