Reduction questions(newbie-ish)

Ok I’ve been pulling my hair out for a while now trying to get a small CUDA program to go. I’ve run into more than a couple roadblocks, not the least of which is my lack of C programming ability. What I’ve been working to do via CUDA is a multiple equal-sized arrays(2) multiplication and reduction. As plain and as simple as I can put it; I have two arrays of floats both with the same number of elements, I need to multiply the data with the same index from the two arrays together(x[i] * a[i]) and sum/reduce the results. I hope thats clear. I’ve been all through this message board and any other CUDA resource I could find, I’ve also been all over the reduction example provided in the SDK and still only have a tenuous grasp of the concepts(more than likely due to my ineptitude with C). I’ve tried to work off the source of the reduction example and mold it to my uses to no avail…

Details:

H/W- Octo Xeon setup @ 2GHZ, 4GB, GTX260 192

S/W- WinXP SP2, Visual Studio 2005, 178.28 drivers

Code- This is what I have boiled down to the CUDA stuff only. This code is changing practically by the minute now as I try different stuff almost at random to see what works(yea…I am now that desperate to get this simple program to work right)

void CUDAexec(xmlNodePtr node, int sets, float data1[], float data2[], xmlNodeSetPtr nodeList){

	int N = sets;

	printf("syns: %i \n", N);

	float nullResult=0.00f;

	float* nullResultPtr= &nullResult;

	size_t size = N*sizeof(float);

	size_t size2 = sizeof(float);

	float* d_data1;

	cudaMalloc( (void**) &d_data1, size);

	float* d_data2;

	cudaMalloc( (void**) &d_data2, size);

	float* d_temp;

	cudaMalloc( (void**) &d_temp, size);

	float* d_result;

	cudaMalloc( (void**) &d_result, size2);

	cudaMemcpy(d_data1, data1, size, cudaMemcpyHostToDevice);

	cudaMemcpy(d_data2, data2, size, cudaMemcpyHostToDevice);

	cudaMemcpy(d_result, nullResultPtr, size, cudaMemcpyHostToDevice);

	 int blockSize = 4;

	int nBlocks = N/blockSize + (N%blockSize == 0?0:1);

	testKernel <<< nBlocks, blockSize >>> ( d_data1, d_data2, d_temp, d_result, N);

	float* h_result = (float*) malloc(sizeof( float));

	cudaThreadSynchronize();

	cudaMemcpy( h_result, d_result, sizeof( float), cudaMemcpyDeviceToHost);

	printf ("result:%f \n", h_result[0]);

//kernel, the closest working one anyways...

__global__ void

testKernel( float* g_data1, float* g_data2, float* g_temp, float* g_result, int N) 

{

  const unsigned int tid = threadIdx.x;

  if (tid == 0){

	  g_data1[0] += g_data1[tid] * g_data2[tid];

	  g_result[blockIdx.x] = g_data1[0];

  }

  else {

	 g_data1[tid-1] = g_data1[tid] * g_data2[tid];

  }

  __syncthreads();

}

I loathe having to post to a message board to ask for help but I have read everything available and really am out of ideas here, any help would be greatly appreciated!!

P.S. I almost get the seqeuntial addressing verion in the reduction example, it makes a lot of sense but how exactly does this loop work?

for (unsigned int s=blockDim.x/2; s>0; s>>=1) {

if (tid < s) {

sdata[tid] += sdata[tid + s];

}

I’m struggling with that bitwise adjustment there, how does that work exactly?? More of a basic C question I know, but still…

It is dividing blockDim.x by 2 at each step. So each step half the number of threads as the step before are running.

There is a pdf about the reduction example.

Thanks for the quick reply! As I said I’ve been all over the reduction example, including reading the .pdf front to back(multiple times), I guess it comes down to it snapping into place and making sense. Still going nowhere here, about to call it quits for the night(morning, yikes!)…before I do I wanted to share a few things that may help somebody help me…

here’s my CPU function to check the results, this spits back correct answers:

//CPU equivalent

	for(int i=0; i<N;i++){

		nullResult+=data1[i]*data2[i];

	}

Now I have learned enough to know that I can’t run a CUDA kernel similar to how it’s run on the CPU, all the multiplication and sums have to go in parallel. So that means using a single float value to keep a running total is out because CUDA doesn’t work like that, ok I got that. Which brings me, inexorably, back to the Nvidia reduction sample, which as I said before I somewhat get. But even though I can see how it all works in theory, it is definately still not working for me in practice. I’ve been trying every which way to get that simple summation/multiply above to work with CUDA and all I’ve been getting are wrong answers. I really thought I was close with this latest one:

for(unsigned int s=blockDim.x/2; s>0; s=s/2){

	if (tid < s) 

		{

		g_data1[tid]= (g_data1[tid]*g_data2[tid]) + (g_data1[tid+s] * g_data2[tid+s]);

		}

		__syncthreads();

  }

	if (tid == 0) g_result[blockIdx.x] = g_data1[0];

Unfortunately this kernel still outputs the wrong result; Not that far off, but also not that close, from the CPU’s result. What am I missing here? What am I doing wrong? Again, any help would be so very appreciated, please CUDA gurus show a young padawan his mistake(s)…

Umm, read through the reduction pdf again and pay very careful attention to when they are accessing global memory and when they are accessing shared memory. There is a BIG difference (specifically for this case, not even __syncthreads() can synchronize access to global memory).

You’ll learn more when you finally figure it out for yourself, but if you want the answer, just say the word.

OK, I’m getting it I think. I’m trying to use shared memory now but I can’t seem to get it to work, the only results I get are “INF000”! I think I’m having problems with the size of the shared memory being allocated, the Ns term in the <<<Dg, Db, Ns>>> part of kernel invocation. What should that value be? Right now I have it as:

(numThreadsPerBlock * sizeof(float)) * 2

2 for the 2 arrays, right? I’ve also tried 4 because technically each thread multiplies two pairs of numbers together and then sums the products so thats 4 values per thread but it still doesn’t work. Thanks for the replies so far, I feel like I’m close to getting this to finally work correctly…

I apologize for being a dunce and having to post again but I am just not understanding shared memory size/addressing apparently.

This is what I have so far:

void CUDAexec(xmlNodePtr node, int sets, float data1[], float data2[], xmlNodeSetPtr nodeList){

	int N = sets;

	int numThreadsPerBlock = 4;

	int numBlocks = N/numThreadsPerBlock + (N%numThreadsPerBlock == 0?0:1);

	int sharedMemSize = (numThreadsPerBlock * sizeof(float)) * 4;

	printf("Shared memory allocated per block: %i bytes \n", sharedMemSize);

	float nullResult=0.00f;

	float* nullResultPtr= &nullResult;

	size_t size = numBlocks * numThreadsPerBlock * sizeof(float); 

	size_t size2 = numBlocks * sizeof(float);

	 float* d_data1;

		cudaMalloc( (void**) &d_data1, size);

		float* d_data2;

   	 cudaMalloc( (void**) &d_data2, size);

	float* d_result;

		cudaMalloc( (void**) &d_result, size2);

	// copy host memory to device

   	 cudaMemcpy(d_data1, data1, size, cudaMemcpyHostToDevice);

	cudaMemcpy(d_data2, data2, size, cudaMemcpyHostToDevice);

	cudaMemcpy(d_result, nullResultPtr, size2, cudaMemcpyHostToDevice);

	 // do calculation on device:

	 // Part 1 of 2. Compute execution configuration

	dim3 dimGrid(numBlocks); 

	dim3 dimBlock(numThreadsPerBlock); 

	sum <<< dimGrid, dimBlock, sharedMemSize >>> ( d_data1, d_data2, d_result, N);

	 // allocate mem for the result on host side

   	 float* h_result = (float*) malloc(sizeof( float));

	cudaThreadSynchronize();

	//CPU equivalent

	for(int i=0; i<N;i++){

		//printf("%f", data1[i]);

		nullResult+=data1[i]*data2[i];

	}

	// copy result from device to host

   	 cudaMemcpy( h_result, d_result, sizeof( float), cudaMemcpyDeviceToHost);

	printf ("GPU/CUDA result:%f \n", h_result[0]);

	printf("CUDA error: %s\n", cudaGetErrorString(cudaGetLastError())); 

	printf ("CPU result:%f \n", nullResult);

	int craphead= 1/2;

	printf ("---------------------------------------------------------------- \n");

	// cleanup memory

   	 free( h_result);

		cudaFree(d_data1);

		cudaFree(d_data2);

		cudaFree(d_result);

  }

and the kernel:

__global__ void

sum(float* g_data1, float* g_data2, float* g_result, int N) 

{

	// load shared mem

	unsigned int tid = threadIdx.x;

	unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

	extern __shared__ float sData[];

	float* sdata1 = sData;

	float* sdata2 = &sdata1[blockDim.x];

	sdata1[tid] = g_data1[i];

	sdata2[tid]=g_data2[i];

	__syncthreads();

	// do reduction in shared mem

   for( unsigned int s=blockDim.x/2; s>0; s/=2) 

	{

		if (tid < s) {

		   sdata1[tid] =(sdata1[tid] * sdata2[tid]) + (sdata1[tid + s] * sdata2[tid + s]); 

		}

		__syncthreads();

	}

	// write result for this block to global mem

   if (tid == 0){

	__syncthreads();

	g_result[blockIdx.x] = sdata[0];

   }

}

I’ve been wasting entirely too much time trying to get this to work correctly and am becoming increasingly frustrated. This was supposed to be a few simple feasibility tests to see if a project I’ve been working on would transfer over well(easily?!?!) to CUDA and it’s turned into a weeklong+ nightmare of late nights and no results. I’ve been reading everything I could find on shared mem, and re-read the reduction example paying close attention to the shared mem usage. I’m pretty sure it’s an addressing problem I’m having here. I’ve been trying to keep everything REALLY simple. The two arrays being multiplied & summed are both equal length(4, again to stay simple). One array is full of 1.0 values and the other is some random values < 1.0.You would think(I would at least) that 4 elements per array and 4 threads would make things easy, it hasn’t. The kernel is still spitting out incorrect results. I was getting really outrageous garbage results, letters and really long numbers. I was able to finally get it to produce results that weren’t garbage but the results are still off quite a ways. Like I said I must have a really simple stupid mistake some where messing up my addressing I think. Please, please, please can someone throw me a life line and show me my mistake(s), I’m about to throw in the towel here.

I haven’t thought through it completely, but I don’t think your reduction is doing what you want it to do. Values are going to get multipled several times I think.

Why not just make life simple, and do the multiplication independent from the reduction:

__global__ void

sum(float* g_data1, float* g_data2, float* g_result, int N)

{

	// load shared mem

	unsigned int tid = threadIdx.x;

	unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;

	extern __shared__ float sData[];

	float* sdata1 = sData;

	sdata1[tid] = g_data1[i] * g_data2[i];

	__syncthreads();

	// do reduction in shared mem

   for( unsigned int s=blockDim.x/2; s>0; s/=2)

	{

		if (tid < s) {

		   sdata1[tid] = sdata1[tid] + sdata1[tid + s];

		}

		__syncthreads();

	}

	// write result for this block to global mem

   if (tid == 0){

	// __syncthreads(); <--- This is causing your kernel to crash! remove it!

	g_result[blockIdx.x] = sdata1[0];

   }

}

Oh, and are you checking for errors after the kernel launch? (call cudaThreadSynchronize() and then cudaGetLastError(), or use the CUT_CHECK_ERROR macro in a debug build). You had a __syncthreads() inside an if which would cause a deadlock. Thus the g_result output was likely never written in your tests and you were reading garbage memory.

That got it to work, thank you MisterAnderson!!