Hello there!
I have an application that at some point needs to evaluate an unsigned 32bit integer array in two different ways:
A) - The output of __popc(array[/some_index_within_array_range/]) are summed
B) - The output of array[/some_index_within_array_range/] are OR’d, after which they go through __popc() and are summed
To do so I changed a little bit the reduction algorithm in the SDK, namely the reduction6. The more straightforward changes are:
From original to A) and to B)
mySum += g_idata[i+blockSize]; //original
mySum += __popc((0x0000FFFF & g_idata[i])); //A) (and also I'm only interested in the LS 16 bits)
mySum|=(0x0000FFFF & g_idata[i]); //B) (again I'm only interested in the LS 16 bits)
Then on the warp synchronized portion of the code it reads from original to A) and B)
if (tid == 0)
g_odata[blockIdx.x] = sdata[0]; //original
if (tid == 0)
atomicAdd(g_odata,sdata[0]); //A)
if (tid == 0)
atomicMax(g_odata, __popc(sdata[0])); //B)
Furthermore, I perform the last stage of the reduction in the GPU.
g_odata is a unsigned int defined on the host side as
cudaHostAlloc((void**)&h_reduction,sizeof(unsigned int),cudaHostAllocMapped | cudaHostAllocPortable);
and on the device side as
cudaHostGetDevicePointer((void**)&m_reduction,(void*)h_reduction,0);
(In the GPU context preamble I’ve declared cudaSetDeviceFlags(cudaDeviceMapHost) on a Tesla C2050, so it’s definitely supported).
But I’m getting and odd thing. If I printf from the device the m_reduction variable it prints a non-zero value, when I printf from host side it prints 0. Since my kernels are called synchronously, what can be the cause for this?
I can include the source code if required.
Thank you.