Mapped Memory, CPU->GPU Example

I’ve tried everything I can to get communication into the kernel from the CPU using mapped/pinned memory (zero copy).

Incrementing a mapped integer from the GPU shows up on the CPU just fine. However if I increment the mapped integer on the CPU and watch for it to breach a number on the GPU before returning, I get an infinite loop in the kernel.

For example, the GPU hangs here:

__global__ void kernel(int* d_number) {

    while(*d_number < 10) { } // also tried many variations in case the loop was discarded

    return;
}

int main() {

    int *h_number;
    int *d_number;
    cudaSetDeviceFlags(cudaDeviceMapHost);
    cudaMallocHost(&h_number, sizeof(int), cudaHostAllocMapped);
    cudaHostGetDevicePointer(&d_number, h_number, 0);
    
    *h_number = 0;

    dim3 blocks(1,1);
    dim3 threads(1,1);
    kernel<<<blocks, threads>>>>(d_number);

    sleep(1); // allow the kernel to get started
    *h_number = 100; // the should trigger the kernel to break from while loop

    cudaThreadSynchronize(); // here it blocks for return of kernel, which never comes

    return 0;
}

Could someone provide me with a working example of this? Or have I misinterpreted something, and this isn’t possible?

Thanks in advance for your time and any help

In general, making a kernel spin-wait for anything is a Bad Idea. It is inefficient, fragile, and error prone. It’s terrible programming practice. You are asking for pain. You have been warned.
The right solution is to use events and/or streams to guarantee ordering and correctness.

But that said, the main issue in your example is that the host memory read over PCIe is likely being cached ondie by the device and therefore the changed memory on the host isn’t seen.

The solution is to tell CUDA that the host memory shouldn’t be cached, but set as write combined. Add the cudaHostAllocWriteCombined flag in your cudaMallocHost to set this mode for the allocated memory.

Of course I haven’t tried this in your example. Because, as I will repeat, spinwaits on memory changes are Bad, even for CPU only code, but even more so in CUDA, and triply so for host-device shared memory. Don’t do that.

Thanks for the response. This isn’t production code or anything, I’m just hacking around to learn, and answers like yours are helpful to that cause.

I added the write combined flag with no luck. I did a little debugging and the device is seeing the value passed in before the kernel launch and maintains that as the value even as the host is changing it, through each iteration of the while loop. *d_number is read on the host and changing *h_number does indeed map itself to *d_number as is checked from CPU code, but it isn’t changing *d_number in the kernel. It’s as if it isn’t truly mapped…

Hopefully since it’s a small program someone with more experience than me can get a demo of this working.

OK, again without trying it, I’ll give yet more dangerous advice. The “volatile” keyword is probably needed to inform the compiler that the memory location may be changed behind its back. That will likely defeat any code-level assumptions the compiler is making in the compare loop itself. You can look at the generated SASS code via cudaobjdump to see what’s actually being executed.

I’m not sure if you can define the argument as volatile, but you can cast it in the actual indirection call.

(I should really try this myself before giving advice, but I’m not on any of my CUDA development boxes now.)

Thanks! That worked. I was able to mark the argument as volatile, and now everything works the way I expected (albeit dangerous). I appreciate the help.