Vector Reduction

Hi all! I’m starting my Master’s Thesis on CUDA programming in order to achieve faster aquisition times in a tomosynthesis machine. The problem is that i dont come from the informatic area(Im a biomedical engineer), so my background of programming languages is not that wide, so i think ill be spending a lot of time in this forum.

I’m making my first CUDA programs and im having a hard time getting some good results. The program should receive a vector in the GPU and add its elements(its already optimized in order not to branch diverge):

global void vecsum(float*vec){
shared float temp[16];

int i;

int idx=threadIdx.x+blockIdx.x*blockDim.x;
temp[idx]=vec[idx];
__syncthreads();

for(i=blockDim.x;i>0;i=i/2)

	if(idx<blockDim.x)
		temp[idx]+=temp[idx+i];
__syncthreads();
 if(idx == 0)

		vec[0] = temp[0];	

}

int main(){
float N[16]={0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
float vec_h=(float)malloc(sizeof(float)16);
float vec_d;
cudaMalloc((void
)&vec_d,16sizeof(float));
cudaMemcpy(vec_d,N,16
sizeof(float),cudaMemcpyHostToDevice);
vecsum<<<2,8>>>(vec_d);
cudaMemcpy(vec_h,vec_d,16*sizeof(float),cudaMemcpyDeviceToHost);
cudaFree(vec_d);
printVec(vec_h,16);
}

The weird thing is that, besides the wrong result,that same result, changes from execution to execution.
{506.242615 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 }
{520.242615 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 }

I know that a zero-filled vector is stupid, but i had already tried with a vector[1-15] and all the elements were right except of course the important one, the first.

Thanks in advance.

There are a number of problems with the device code, but the most obvious error is occuring because you are storing values from uninitialized shared memory in global memory. Trying asking yourself what will happen in the kernel when blockIdx.x is not 0.

Thank you for the reply avidday!

I got it working. The problem is that i cheated. Basically i remade the problem for just 1 block with 16 threads and the “for” cycle becomes like

for(i=blockDim.x/2;i>0;i=i/2)

I was checking some lectures about CUDA and i have only now realized that they were working with only one block of variable dimension. I was trying to do something more dynamic(although i was completely off of it lol), wouldnt i achieve better performance using 2 blocks of 8 threads insted of 1/16 to compute the vector?

Also, could you tell me about the uninitialized shared memory problem, i made it while looking at some exercises i got from a summer school course. About the blockIdx.x, i dont know where the problem lies, i only use blockIdx.x to fill the shared memory vector.

Hi again. In the following of the vector reduction algorithm, i am now coding so that i can use multiple blocks, to allow me to work with arrays larger than 512 elements. Anyway for Arrays with power of 2 elements, i got it working, but now i am trying to make it work for an array with any number of elements(be it even/odd or power of 2/non power of 2).

I Know the code is almost right, because i have a working code, which belongs to a Summer School Course. My program is composed of CODE 1.1(Kernel) and CODE 2.1(Kernel Loop), whereas the Summer School Code is composed of CODE 1.2 and 2.2.

My program works if i use CODE 1.2 with either CODE 2.1 OR 2.2(so the problem is suposed to be in the kernel). The weird thing is, if i swap partial pieces of CODE 1.1 with 1.2, it never works(so i cant find the error, although i know its somewhere in CODE 1.1, i dont know what is wrong with it. Thanks in advance

__global__ void vecsumkernel(float*input, float*output,int nelements){

//CODE 1.1

	__shared__ float psum[blocksize];

	int tid=threadIdx.x,stride;

	if(tid+blockDim.x*blockIdx.x<nelements)

	psum[tid]=input[tid+blockDim.x*blockIdx.x];

	else

	psum[tid]=0.0f;

	__syncthreads();

	for(stride=blockDim.x/2;stride>0;stride>>=1){

		if(tid<stride)

			psum[tid]+=psum[tid+stride];

	__syncthreads();

	}

	if(tid==0)

		output[blockIdx.x]=psum[0];

/* CODE 1.2

__shared__  float scratch[BLOCK_SIZE*2];

int tid = threadIdx.x;

    int stride;

    // Coalesced load to SHM

    int gidx = (blockIdx.x * BLOCK_SIZE * 2 + threadIdx.x);

    scratch[tid]   = gidx < nelements? input[gidx]:0.0f;

    scratch[tid+BLOCK_SIZE] = (gidx+BLOCK_SIZE) < nelements ? input[gidx+BLOCK_SIZE]:0.0f;

    __syncthreads();

// build the sum tree (in-place)

    for (stride = BLOCK_SIZE; stride > 0; stride >>= 1) {

	if (threadIdx.x < stride) {

	    scratch[threadIdx.x] += scratch[threadIdx.x + stride];

	}

	__syncthreads();

    }

// write results to global memory

if(threadIdx.x == 0)

	output[blockIdx.x] = scratch[0];*/

}

void vecsumv2(float*input, float*output, int nelements){

// CODE 2.1

	dim3 dimGrid((int)ceil((double)(nelements)/(double)blocksize),1,1);

	dim3 dimBlock(blocksize,1,1);

	int i=nelements;

	

	for(i=nelements;i>=blocksize;i=(int)ceil(i/blocksize)){

		dim3 dimGrid((int)ceil((double)i/(double)blocksize));

		vecsumkernel<<<dimGrid,dimBlock>>>(i==nelements ?input:output,output,i);

	}

	if(i!=1)

	vecsumkernel<<<1,i>>>(i==nelements ?input:output,output,i);

/*

//CODE 2.2

dim3 grid((numElements + BLOCK_SIZE -1)/(BLOCK_SIZE), 1, 1);

    dim3 block(BLOCK_SIZE, 1, 1);

    int i=0;

    for (i = numElements; i > 1; i = grid.x) {

	grid.x = (i + BLOCK_SIZE * 2 - 1)/(BLOCK_SIZE * 2);

	vecsumkernel<<<grid, block>>>(i == numElements ?input:output,output,i); 

    }*/

}