Memory incoherence between memory spaces portable memory is not synchronized

Hello there!

I have an application that at some point needs to evaluate an unsigned 32bit integer array in two different ways:

A) - The output of __popc(array[/some_index_within_array_range/]) are summed

B) - The output of array[/some_index_within_array_range/] are OR’d, after which they go through __popc() and are summed

To do so I changed a little bit the reduction algorithm in the SDK, namely the reduction6. The more straightforward changes are:

From original to A) and to B)

mySum += g_idata[i+blockSize];                 //original

mySum += __popc((0x0000FFFF & g_idata[i]));    //A) (and also I'm only interested in the LS 16 bits)

mySum|=(0x0000FFFF & g_idata[i]);              //B) (again I'm only interested in the LS 16 bits)

Then on the warp synchronized portion of the code it reads from original to A) and B)

if (tid == 0) 

        g_odata[blockIdx.x] = sdata[0];         //original

if (tid == 0)

        atomicAdd(g_odata,sdata[0]);            //A)

if (tid == 0)

        atomicMax(g_odata, __popc(sdata[0]));   //B)

Furthermore, I perform the last stage of the reduction in the GPU.

g_odata is a unsigned int defined on the host side as

cudaHostAlloc((void**)&h_reduction,sizeof(unsigned int),cudaHostAllocMapped | cudaHostAllocPortable);

and on the device side as

cudaHostGetDevicePointer((void**)&m_reduction,(void*)h_reduction,0);

(In the GPU context preamble I’ve declared cudaSetDeviceFlags(cudaDeviceMapHost) on a Tesla C2050, so it’s definitely supported).

But I’m getting and odd thing. If I printf from the device the m_reduction variable it prints a non-zero value, when I printf from host side it prints 0. Since my kernels are called synchronously, what can be the cause for this?

I can include the source code if required.

Thank you.

This is weird, although it has happened to me before, a reboot solved the memory incoherence. The same test now came clean!

Do you check error codes to see if the kernel has actually run?

I use a cudacheckError function provided in a Dr. Dobbs tutorial

void checkCUDAError(const char *msg)

{

    cudaError_t err = cudaGetLastError();

    if( cudaSuccess != err) 

    {

        fprintf(stderr, "Cuda error: %s: %s.\n", msg, cudaGetErrorString( err) );

        exit(EXIT_FAILURE);

    }                         

}

and all CUDA functions return cudaSuccess.

I’m using CUDA 4.0 but my code was developed in CUDA 3.2. I’ve seen the webinar of CUDA features and read some new functions of the Unified Virtual Addressing but haven’t made any change to my code.

It had to exist to some kind of memory fence instruction (I guess the requirement is a memory fence). Basically, if there wasn’t any cudaDeviceSynchronize() after the reductio kernels the CPU pointers only printed their correct values by an offset of 1 reduction kernel launch.

All is well now. However, since I’m calling the kernels synchronously shouldn’t the CPU print its host pointers after the execution of the kernel? Seems like control is given back to the CPU before the mapped memory space is written, a Write-after-Read Hazard.