Vector Reduction

X-FrEaK · March 2, 2011, 4:18pm

Hi all! I’m starting my Master’s Thesis on CUDA programming in order to achieve faster aquisition times in a tomosynthesis machine. The problem is that i dont come from the informatic area(Im a biomedical engineer), so my background of programming languages is not that wide, so i think ill be spending a lot of time in this forum.

I’m making my first CUDA programs and im having a hard time getting some good results. The program should receive a vector in the GPU and add its elements(its already optimized in order not to branch diverge):

global void vecsum(float*vec){
shared float temp[16];

int i;

int idx=threadIdx.x+blockIdx.x*blockDim.x;
temp[idx]=vec[idx];
__syncthreads();

for(i=blockDim.x;i>0;i=i/2)

	if(idx<blockDim.x)
		temp[idx]+=temp[idx+i];
__syncthreads();
 if(idx == 0)

		vec[0] = temp[0];

}

int main(){
float N[16]={0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
float vec_h=(float)malloc(sizeof(float)16);
float vec_d;
cudaMalloc((void)&vec_d,16sizeof(float));
cudaMemcpy(vec_d,N,16sizeof(float),cudaMemcpyHostToDevice);
vecsum<<<2,8>>>(vec_d);
cudaMemcpy(vec_h,vec_d,16*sizeof(float),cudaMemcpyDeviceToHost);
cudaFree(vec_d);
printVec(vec_h,16);
}

The weird thing is that, besides the wrong result,that same result, changes from execution to execution.
{506.242615 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 }
{520.242615 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 }

I know that a zero-filled vector is stupid, but i had already tried with a vector[1-15] and all the elements were right except of course the important one, the first.

Thanks in advance.

avidday · March 2, 2011, 4:28pm

There are a number of problems with the device code, but the most obvious error is occuring because you are storing values from uninitialized shared memory in global memory. Trying asking yourself what will happen in the kernel when blockIdx.x is not 0.

X-FrEaK · March 2, 2011, 5:39pm

Thank you for the reply avidday!

I got it working. The problem is that i cheated. Basically i remade the problem for just 1 block with 16 threads and the “for” cycle becomes like

for(i=blockDim.x/2;i>0;i=i/2)

I was checking some lectures about CUDA and i have only now realized that they were working with only one block of variable dimension. I was trying to do something more dynamic(although i was completely off of it lol), wouldnt i achieve better performance using 2 blocks of 8 threads insted of 1/16 to compute the vector?

Also, could you tell me about the uninitialized shared memory problem, i made it while looking at some exercises i got from a summer school course. About the blockIdx.x, i dont know where the problem lies, i only use blockIdx.x to fill the shared memory vector.

X-FrEaK · March 9, 2011, 5:29pm

Hi again. In the following of the vector reduction algorithm, i am now coding so that i can use multiple blocks, to allow me to work with arrays larger than 512 elements. Anyway for Arrays with power of 2 elements, i got it working, but now i am trying to make it work for an array with any number of elements(be it even/odd or power of 2/non power of 2).

I Know the code is almost right, because i have a working code, which belongs to a Summer School Course. My program is composed of CODE 1.1(Kernel) and CODE 2.1(Kernel Loop), whereas the Summer School Code is composed of CODE 1.2 and 2.2.

My program works if i use CODE 1.2 with either CODE 2.1 OR 2.2(so the problem is suposed to be in the kernel). The weird thing is, if i swap partial pieces of CODE 1.1 with 1.2, it never works(so i cant find the error, although i know its somewhere in CODE 1.1, i dont know what is wrong with it. Thanks in advance

__global__ void vecsumkernel(float*input, float*output,int nelements){

//CODE 1.1

	__shared__ float psum[blocksize];

	int tid=threadIdx.x,stride;

	if(tid+blockDim.x*blockIdx.x<nelements)

	psum[tid]=input[tid+blockDim.x*blockIdx.x];

	else

	psum[tid]=0.0f;

	__syncthreads();

	for(stride=blockDim.x/2;stride>0;stride>>=1){

		if(tid<stride)

			psum[tid]+=psum[tid+stride];

	__syncthreads();

	}

	if(tid==0)

		output[blockIdx.x]=psum[0];

/* CODE 1.2

__shared__  float scratch[BLOCK_SIZE*2];

int tid = threadIdx.x;

    int stride;

    // Coalesced load to SHM

    int gidx = (blockIdx.x * BLOCK_SIZE * 2 + threadIdx.x);

    scratch[tid]   = gidx < nelements? input[gidx]:0.0f;

    scratch[tid+BLOCK_SIZE] = (gidx+BLOCK_SIZE) < nelements ? input[gidx+BLOCK_SIZE]:0.0f;

    __syncthreads();

// build the sum tree (in-place)

    for (stride = BLOCK_SIZE; stride > 0; stride >>= 1) {

	if (threadIdx.x < stride) {

	    scratch[threadIdx.x] += scratch[threadIdx.x + stride];

	}

	__syncthreads();

    }

// write results to global memory

if(threadIdx.x == 0)

	output[blockIdx.x] = scratch[0];*/

}

void vecsumv2(float*input, float*output, int nelements){

// CODE 2.1

	dim3 dimGrid((int)ceil((double)(nelements)/(double)blocksize),1,1);

	dim3 dimBlock(blocksize,1,1);

	int i=nelements;

	

	for(i=nelements;i>=blocksize;i=(int)ceil(i/blocksize)){

		dim3 dimGrid((int)ceil((double)i/(double)blocksize));

		vecsumkernel<<<dimGrid,dimBlock>>>(i==nelements ?input:output,output,i);

	}

	if(i!=1)

	vecsumkernel<<<1,i>>>(i==nelements ?input:output,output,i);

/*

//CODE 2.2

dim3 grid((numElements + BLOCK_SIZE -1)/(BLOCK_SIZE), 1, 1);

    dim3 block(BLOCK_SIZE, 1, 1);

    int i=0;

    for (i = numElements; i > 1; i = grid.x) {

	grid.x = (i + BLOCK_SIZE * 2 - 1)/(BLOCK_SIZE * 2);

	vecsumkernel<<<grid, block>>>(i == numElements ?input:output,output,i); 

    }*/

}

Topic		Replies	Views
NVIDIA SDK Example on Reduction Reduction Ver 1: from the Whitepaper on Reductions CUDA Programming and Performance	3	3448	July 23, 2009
Regarding Vector Reduction To find sum of all components of a vector CUDA Programming and Performance	1	3287	July 22, 2009
Reduction questions(newbie-ish) CUDA Programming and Performance	7	1885	January 14, 2009
Reduction results differs every time CUDA Programming and Performance	2	462	April 9, 2019
Regarding Vector Reduction To find sum of all components of a vector CUDA Programming and Performance	0	626	July 22, 2009
sum 10 of 1s, got 9.9999; sum 100000 of 1s, got 1.8535X10^193; CUDA Programming and Performance	4	2783	April 4, 2012
Is it mandatory to use shared memory in the kernel CUDA Programming and Performance	9	4342	October 11, 2010
Vector Reduction cudaMemcpy() (error code invalid argument)! CUDA Programming and Performance	1	1408	December 4, 2013
Parallel Reduction CUDA Programming and Performance	10	3857	June 26, 2011
CUDA - calculation of a sum CUDA Programming and Performance	7	5694	April 30, 2010

Vector Reduction

Related topics