Hello
When writting some program i run into some problems. Here is simplified code that produces the same problem:
/*
* Host code.
*/
// includes, system
#include <stdlib.h>
#include <stdio.h>
// includes, project
#include <cutil.h>
// includes, kernels
#include <test_kernel.cu>
#define BLOCKS 256
int main( int argc, char** argv)
{
CUT_DEVICE_INIT();
int noMemSizeR = sizeof(int) * BLOCKS;
// allocate host's arrays
int* hostResult = (int*) malloc( noMemSizeR);
// allocate device's arrays
int* dResultData;
CUDA_SAFE_CALL( cudaMalloc((void**) &dResultData, noMemSizeR) );
dim3 dimGrid(BLOCKS, 1, 1);
dim3 dimBlock(1, 1, 1);
findMaxKernel<<<dimGrid, dimBlock>>>(dResultData);
CUDA_SAFE_CALL( cudaThreadSynchronize() );
CUDA_SAFE_CALL( cudaMemcpy( hostResult, dResultData, BLOCKS, cudaMemcpyDeviceToHost) );
for(int t=0; t<BLOCKS; t++)
printf("\tBlock: %d result: %d\n",t,hostResult[t]);
printf("\n");
CUT_EXIT(argc, argv);
}
/*
* Device code.
*/
__global__ void findMaxKernel(int *g_odata) {
g_odata[blockIdx.x] = blockIdx.x;
}
What i am trying to do is:
-
do some calculations on some big array (array is split in multiple blocks - for synchronization purposes)
-
each block returns 1 integer result
-
when all blocks are executed repeat calculation (by running the same kernel again) on those returned results
-
repeat untill only 1 element is obtained
In the upper code, problem is that i only get returned results for 64 elements, all others are 0. Can anyone explain what am i missing here?
Any help would be appreciated, thanks.