SDK Parallel Reduction Help, different result between EmuDebug and Release...

Dear all,

I’m studying SDK’s parallel reduction now. I’m trying to rewrite the host code using my own, but I leave the kernel as it is. But I encountered this error:

  1. EmuDebug gave this error:

[codebox]HEAP[CUDAReduce.exe]: Heap block at 00201110 modified at 0020112B past requested size of 13

Windows has triggered a breakpoint in CUDAReduce.exe.

This may be due to a corruption of the heap, and indicates a bug in CUDAReduce.exe or any of the DLLs it has loaded.

The output window may have more diagnostic information


The error above occured just before the kernel end.

  1. Debug and Release didn’t emit any error, but the reduction result between CPU code and GPU code is different.

Have somebody ever encountered this kind of error before? What should I do?

Any help will be appreciated.

Thank you very much…


Post your host code. I would guess it’s a pointer error.

All right. RunReduce is application entry point…

[codebox]// kernel invocation

void RunReduce(int argc, char* argv)


unsigned int size = 1<<6; // size of element to be reduced

// create random input data on CPU

size_t bytes = size * sizeof(int);

int* toReduce = (int*) malloc(bytes);

// fill the number

for (unsigned int i = 0; i < size; i++)


	toReduce[i] = (int)(1);


int gpuResult2 = ReduceGPU(toReduce, size, 2, false);

//// cleanup




… ReduceGPU is the host function…

[codebox]int ReduceGPU( int* inputData,

		   size_t inputCount,

		   int funcVersion,

		   bool multiGpu)


int gpuResult = 0;

size_t bytes = inputCount * sizeof(int);

//// allocate input and output data in global memory

int* iDataD = NULL;

CUDA_SAFE_CALL(cudaMalloc((void**)&iDataD, bytes));

// copy data directly to device's global memory

CUDA_SAFE_CALL(cudaMemcpy(iDataD, inputData, bytes, cudaMemcpyHostToDevice));

dim3 dimBlock;

dim3 dimGrid;

if (funcVersion == 1)

{// this block wont be invoked

	int* oDataD = NULL;

	CUDA_SAFE_CALL(cudaMalloc((void**)&oDataD, 1));

	dimBlock.x = 1;

	dimGrid.x = 1;

	reduceKernelV1<<<dimGrid, dimBlock>>>(iDataD, oDataD, inputCount);

	CUDA_SAFE_CALL(cudaMemcpy(&gpuResult, oDataD, sizeof(int), cudaMemcpyDeviceToHost));


} else


	if (multiGpu)


	} else


// this block will be invoked

		int* oDataD = NULL;

		CUDA_SAFE_CALL(cudaMalloc((void**)&oDataD, bytes));

		// check device properties

		int deviceCount = 0;


		// assume that there is minimum 1 gpu

		cudaDeviceProp deviceProp;

		CUDA_SAFE_CALL(cudaGetDeviceProperties(&deviceProp, 0));


		// in the case input count < maxThreadsPerBlock

		int maxThreadsPerBlock = min(deviceProp.maxThreadsPerBlock,inputCount);


		dimBlock.x = maxThreadsPerBlock;

		dimGrid.x = inputCount / maxThreadsPerBlock;

		reduceKernelV2<<<dimGrid, dimBlock>>>(iDataD, oDataD);

#ifdef _DEBUG

// if debug, check the content of oData

		int* oData = (int*)malloc(bytes / dimGrid.x);

		CUDA_SAFE_CALL(cudaMemcpy(oData, oDataD, bytes / dimGrid.x, cudaMemcpyDeviceToHost));



		// total of input block

		int j = dimGrid.x;


		while (j > 1)


			dimGrid.x = j / maxThreadsPerBlock;

			reduceKernelV2<<<dimGrid, dimBlock>>>(oDataD, oDataD);

			j /= maxThreadsPerBlock;


		// copy last element

		CUDA_SAFE_CALL(cudaMemcpy(&gpuResult, oDataD, sizeof(int), cudaMemcpyDeviceToHost));

		// free data





return gpuResult;


And here is the kernel code, as in the SDK Samples…


// Version 2: using n threads

global void reduceKernelV2(int* iData, int *oData)


extern __shared__ int sdata[];

unsigned int tid = threadIdx.x;

unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

sdata[tid] = iData[i];


// do reduction in shared mem

for (unsigned int s = 1; s < blockDim.x; s *= 2)


	if (tid % (2*s) == 0)


		sdata[tid] += sdata[tid + s];


		printf("adding %d and %d, total = %d\n", tid, tid + s, sdata[tid]);





// write (intermediate) result for this block to global memory

if (tid == 0)


	oData[blockIdx.x] = sdata[0];


	printf("Block %d finished!\n", blockIdx.x);





A quick scan of your code reveals a couple of problems.

First, in the first if block of RunGPU (labeled " this block wont be invoked"), you only allocate one byte of memory for the output buffer and later attempt to copy sizeof(int) (which is more than one byte on almost all systems) bytes out of it. Given your comment, I assume you are not attempting to debug that code yet.

In the second block of code, I notice that you are not passing an item count to reduceKernelV2 so I am speculating that you are attempting to compute the item count in the kernel from the launch parameters. My guess is that you are doing something like “[font=“Courier New”]inputCount = blockDim.x * gridDim.x[/font]” in the kernel. Note, however, that this number will be too large (resulting in access to invalid memory) if the actual [font=“Courier New”]inputCount [/font]is less than [font=“Courier New”]deviceProp.maxThreadsPerBlock[/font], and will be too small if [font=“Courier New”]inputCount[/font] is not a integral multiple of [font=“Courier New”]maxThreadsPerBlock[/font] (resulting in an invalid calculation).

Check your code closely for these types of errors (and please excuse me if I have made incorrect assumptions). It might be useful to step through your code in emulation mode and verify that all of your calculations produce the numbers you expect.

Finally, note that I believe that the in-place reductions after the initial reduction are an error in the example (but one that apparently is not creating any problems with the current drivers and hardware). See my post here for details.

Good luck!

Wow, I dont know my error is that bad :)

But now I have realized the root of error: the shared memory is not initialized in kernel launch. Thanks!