Hi,
I’m trying to launch multiple kernel and get one result as follow :
// cudaMalloc(d_result)
// fill d_result with '
// cudaMalloc(d_result)
// fill d_result with '\0'
for( i=0; i<N; i++)
mykernel <<< 512, 64 >>> (myArray, i, d_result );
cudaThreadSynchronize();
cudaMemcpy(result, d_result, 8 * sizeof(char), cudaMemcpyDeviceToHost);
printf("%s\n", result);
'
for( i=0; i<N; i++)
mykernel <<< 512, 64 >>> (myArray, i, d_result );
cudaThreadSynchronize();
cudaMemcpy(result, d_result, 8 * sizeof(char), cudaMemcpyDeviceToHost);
printf("%s\n", result);
In the above code, d_result is an array of 8 chars which is filled with ‘\0’ before launching the kernel. The kernel performs some computation according to myArray (constant) and i (counter). In this example, i go from 0 to N and only one value of i can modify the d_result array.
__global__ void mykernel(int myArray, int i, char *d_result)
{
if(i == 7)
{
d_result[0] = 'o';
d_result[1] = 'k';
}
}
Because of the statement "cudaThreadSynchronize()’ I was thinking that when all kernels returned, only i=7 had modify the d_result array but in fact the printf display ‘nothing’ (\0\0\0…\0 - the init value of d_result).
Is this a normal behavior of such code ? My goal is to launch 1 kernel N times. On thoses N kernel launches, only one can return a “result” - so how do you do if my approach is not correct ? :o)
If I’m right, d_result is in global memory (cudaMalloc) so every blocks in the grid and every threads block have access to the same memory space.
Thanks a lot,
n0mad