I have an application that at some point needs to evaluate an unsigned 32bit integer array in two different ways:
A) - The output of __popc(array[/some_index_within_array_range/]) are summed
B) - The output of array[/some_index_within_array_range/] are OR’d, after which they go through __popc() and are summed
To do so I changed a little bit the reduction algorithm in the SDK, namely the reduction6. The more straightforward changes are:
From original to A) and to B)
mySum += g_idata[i+blockSize]; //original mySum += __popc((0x0000FFFF & g_idata[i])); //A) (and also I'm only interested in the LS 16 bits) mySum|=(0x0000FFFF & g_idata[i]); //B) (again I'm only interested in the LS 16 bits)
Then on the warp synchronized portion of the code it reads from original to A) and B)
if (tid == 0) g_odata[blockIdx.x] = sdata; //original if (tid == 0) atomicAdd(g_odata,sdata); //A) if (tid == 0) atomicMax(g_odata, __popc(sdata)); //B)
Furthermore, I perform the last stage of the reduction in the GPU.
g_odata is a unsigned int defined on the host side as
cudaHostAlloc((void**)&h_reduction,sizeof(unsigned int),cudaHostAllocMapped | cudaHostAllocPortable);
and on the device side as
(In the GPU context preamble I’ve declared cudaSetDeviceFlags(cudaDeviceMapHost) on a Tesla C2050, so it’s definitely supported).
But I’m getting and odd thing. If I printf from the device the m_reduction variable it prints a non-zero value, when I printf from host side it prints 0. Since my kernels are called synchronously, what can be the cause for this?
I can include the source code if required.