cuda reduction kernel from example doesn't run

Hello ,

I’ve copied a reduction kernel example from

http://www.google.de/url?sa=t&rct=j&q=cuda%20rduction%20kernel&source=web&cd=1&cad=rja&ved=0CDcQFjAA&url=http%3A%2F%2Fdeveloper.download.nvidia.com%2Fcompute%2Fcuda%2F1.1-Beta%2Fx86_website%2Fprojects%2Freduction%2Fdoc%2Freduction.pdf&ei=hP8YUZrjHcLQtAasgoHoDg&usg=AFQjCNFZmgdihbG17glvRmF-zPHfmUR4Aw&bvm=bv.42080656,d.Yms

I made a complete program out of this example. But it doesn’t work.

I want to compute the sum of a 256 long array of integers. Just to see how this reduction works.

This is the extracted kernel of the example with a small workaround.

#include <stdio.h>
#include <stdlib.h>

#include <cutil.h>

#define N 256

unsigned int gpuBytes;


__global__ void reduce0(int *g_idata, int *g_odata) {
	extern __shared__ int sdata[];
// each thread loads one element from global to shared mem
	unsigned int tid = threadIdx.x;
	unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;
	printf("in kernel: i = %d\n", i);
	sdata[tid] = g_idata[i];
	printf("sdata[%d] = g_idata[%d] --- %d = %d \n", tid, i, sdata[tid], g_idata[i]);
	__syncthreads();
// do reduction in shared mem

	for (unsigned int s = blockDim.x / 2; s > 0; s >>= 1) {
		if (tid < s) {
			printf("sdata[%d],  sdata[%d] = %d, %d \n", tid, tid + s, sdata[tid], sdata[tid + s]);
			//sdata[tid] += sdata[tid + s];
		}
		printf("s = %d\n", s);
		__syncthreads();
	}



// write result for this block to global mem
	if (tid == 0)
		g_odata[blockIdx.x] = sdata[0];
}


int main() {


	int j[N];
	int k[N];
	int sum = 0;
	for (int  i = 0; i < N; i++) {
		sum += i;
		j[i] = i;
	}
	printf("sum on host: %d\n", sum);



	int *gpu__j;
	int *gpu__k;

	gpuBytes = N * sizeof(int);
	CUDA_SAFE_CALL(cudaMalloc(((void * * )( & gpu__j)), gpuBytes));
	CUDA_SAFE_CALL(cudaMemcpy(gpu__j, j, gpuBytes, cudaMemcpyHostToDevice));

	CUDA_SAFE_CALL(cudaMalloc(((void * * )( & gpu__k)), gpuBytes));
	CUDA_SAFE_CALL(cudaMemcpy(gpu__k, k, gpuBytes, cudaMemcpyHostToDevice));

	reduce0<<<1, N, 0, 0>>>(gpu__j, gpu__k);

	gpuBytes = N * sizeof(int);

	CUDA_SAFE_CALL(cudaMemcpy(j, gpu__j, gpuBytes, cudaMemcpyDeviceToHost));
	CUDA_SAFE_CALL(cudaFree(gpu__j));
	CUDA_SAFE_CALL(cudaMemcpy(k, gpu__k, gpuBytes, cudaMemcpyDeviceToHost));
	CUDA_SAFE_CALL(cudaFree(gpu__k));

	printf("sum on device : %d\n", k[0]);

}

In the kernel I have outputs like this:

sdata[64], sdata[192] = 0, 0
sdata[65], sdata[193] = 0, 0
sdata[66], sdata[194] = 0, 0
sdata[67], sdata[195] = 0, 0


Seems that the external shared variable doesn’t have the correct values. They are all 0. Running the program also implies a "unspecified launch failure" in the first cudamemcpy after the kernel launch. Can someone try to run this? Would be a pleasure.

Thnaks alot
sw

It’s a bit hard to tell because of how the forum handles less-than and greater-than signs, but what are your kernel launch parameters? It would appear that you’re launching with 0 bytes of shared memory (the 3rd parameter of your kernel launch configuration), which of course will cause you trouble when you try to use your shared memory in your kernel. This would be consistent with your “unspecified launch failure” error message

Thanks, the 3rd parameter was always 0 in all examples I’ve ever seen. That’s the standard value. I didn’t know about the meaning of this parameter. So that was the essential hint. Thx again

Glad it helped. I should point out as a follow up that it’s only needed if you’re using shared memory in your kernel without a statically defined size, e.g. as you have it,

extern __shared__ int sdata[];

If you define your shared memory statically, e.g.

__shared__ int sdata[NUM_ELEMENTS];

then you don’t need the 3rd kernel launch parameter. You can read more in the CUDA documentation at

Hello,

Is the example working without any changes? I manage to run the cuda example without problems so far.

I’m not sure about the example in the link (I haven’t looked at it) but the code posted by stevewilson would need some changes, namely the launch parameters and uncommenting the actual reduction step in the kernel. Also note that this reduction will only be done within a given block, so if you use multiple blocks, you’ll have to further sum their results to get the complete reduction. Also your data and block size will have to be a multiple of 2