Problem with getting data from blocks

Hello

When writting some program i run into some problems. Here is simplified code that produces the same problem:

/*

* Host code.

*/

// includes, system

#include <stdlib.h>

#include <stdio.h>

// includes, project

#include <cutil.h>

// includes, kernels

#include <test_kernel.cu>

#define BLOCKS 256

int main( int argc, char** argv) 

{

   CUT_DEVICE_INIT();

   int noMemSizeR = sizeof(int) * BLOCKS;

   // allocate host's arrays

    int* hostResult = (int*) malloc( noMemSizeR);

   // allocate device's arrays

    int* dResultData;

    CUDA_SAFE_CALL( cudaMalloc((void**) &dResultData, noMemSizeR) );

   dim3 dimGrid(BLOCKS, 1, 1);

    dim3 dimBlock(1, 1, 1);

    findMaxKernel<<<dimGrid, dimBlock>>>(dResultData);

   CUDA_SAFE_CALL( cudaThreadSynchronize() );

    CUDA_SAFE_CALL( cudaMemcpy( hostResult, dResultData, BLOCKS, cudaMemcpyDeviceToHost) );

   for(int t=0; t<BLOCKS; t++)

	printf("\tBlock: %d result: %d\n",t,hostResult[t]);

    printf("\n");

   CUT_EXIT(argc, argv);

}

/*

* Device code.

*/

__global__ void findMaxKernel(int *g_odata) {

  g_odata[blockIdx.x] = blockIdx.x;

}

What i am trying to do is:

  • do some calculations on some big array (array is split in multiple blocks - for synchronization purposes)

  • each block returns 1 integer result

  • when all blocks are executed repeat calculation (by running the same kernel again) on those returned results

  • repeat untill only 1 element is obtained

In the upper code, problem is that i only get returned results for 64 elements, all others are 0. Can anyone explain what am i missing here?

Any help would be appreciated, thanks.

You are reading back only 256 bytes = 64 elements.
cudaMemcpy( hostResult, dResultData, BLOCKS, cudaMemcpyDeviceToHost)

It should be:
cudaMemcpy( hostResult, dResultData, noMemSizeR, cudaMemcpyDeviceToHost)

Wow… what a stupid mistake i did.

Well it is working great now. :)

Thanks for such fast response.

My mistake was even worse, I called my kernel like this :

testkernel<<<memory_size, num_threads>>> with memory_size = num_blocks * sizeof(float)
instead of testkernel<<<num_blocks, num_threads>>>

Strange thing is, that everything worked ok until num_blocks was bigger than 1023…