Synchronizing mapped memory use between host and device while kernel is running

Q-man · January 29, 2010, 4:49pm

Hi,

I’m developing a program where the host sends data to the device after a kernel invocation. So the kernel is running some code then it loops waiting for some data from the host through the mapped memory. After the kernel reads the data the host should also loop waiting for the kernel to finish processing the data and then it reads the processed data from the mapped memory also.

When I tried running the code I get an infinite loop. I know when using mapped memory either context or events or streams should be used to synchronize the use of the data in mapped memory. But contexts and streams require the kernel to finish running, so I may need to use events for the synchronization.

How can I successfully run this code? And how can I use events to synchronize using mapped memory while kernel is still running?

I’m stuck here, needing some help

avidday · January 29, 2010, 5:31pm

Sounds like zero copy memory would be useful. You can have the GPU write directly into the host memory across the PCI-e bus. The host can poll a status word until the kernel signals it is finished. It is, however, only supported on GT200s and the Ion/9300M/9400M/MCP79a family.

tmurray · January 29, 2010, 6:16pm

Reading/writing zero-copy memory from the host while a kernel is being executed that accesses the same region is undefined; the PCIe controller may arbitrarily reorder transactions, so there’s nothing you can do to guarantee that behavior.

avidday · January 29, 2010, 6:24pm

Well that goes that idea then. I must admit I have only used zero copy for pushing back reduction output from the device while the host thread was sitting behind a synchronization barrier. So zero copy isn’t a faux interprocess communication mechanism in the making…

tmurray · January 29, 2010, 7:18pm

Every so often I try some new idea to make this work, and every so often it fails miserably again. So no, currently it’s not capable of doing anything really spooky.

Q-man · January 30, 2010, 4:19am

Thanks for the replies guys.

Isn’t there any other way to transfer data (such as sending queries and getting the result of a query back) between host and device back and forth while the kernel is still running?

This issue is really dependent on the work I’m doing so any help is appreciated.

SPWorley · January 30, 2010, 4:24am

Then what promises do you have when using more controlled access such as atomics? Specifically, what’s the policy guarantee when the zero-copy memory is accessed via AtomicAdd? If 10000 threads in different blocks all atomicAdd the zero-copy memory, is the final result (after the kernel has completed) guaranteed to be correct? The atomic access should be tolerant to reordering.

I ask because I have used zero-copy memory with atomics and they seem to work.

tmurray · January 30, 2010, 6:06am

So long as there is a single writer, it’s guaranteed to be coherent. It’s once you start reading/writing from both CPU and GPU at the same time that everything breaks.

Keldor314 · January 30, 2010, 6:22am

What about some sort of packet protocol, which would compute a checksum or a hash of some sort to tell when the entire packet has arrived across the bus? It wouldn’t be very efficient, but would it work, at least in principle?

struct Packet {
long long hashcode;
int id;
Data data;
}

while (incomingPacket.id == lastPacket.id) {
//no packet has been sent yet
}

while (incomingPacket.hashcode != computeHashCode(packet)) {
//the packet is still not fully transferred
}

The computeHashCode would obviously have to include the packet id, so for two different packets, neither the hashcode nor the id should ever match.

It would be nice if there were a way to flush the PCIe bus.

Q-man · January 31, 2010, 2:22am

Ok I modified the code so that the host will be writing to the mapped memory just once and the kernel in the device should be busy waiting until the data is written to mapped memory. Then the kernel should process this data and return the result when it exits.

I’m still getting false results even when I’m just returning back the same data that the kernel read from mapped memory. I don’t know if synchronization is required but all the synchronization methods for mapped memory available that I read assumes the mapped memory data is written by the kernel not the host. So how can I synchronize it the other way around? Even though I’m basically synchronizing with the busy waiting method I’m doing by making the threads loop until the data in the mapped memory changes then the kernel will start processing.

anthonyfmorse · January 31, 2010, 8:12am

There is a much easier way to do this, similar to how to synchronise across all blocks… basically have your Kernel exit when it’s done rather than loop waiting for some event, then have the event call the kernel again, scince the overhead for kernel calls is very low this is a good way to go.

This would also seem to be NVIDIA’s advice as I got the idea from here… [url=“http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/reduction/doc/reduction.pdf”]http://developer.download.nvidia.com/compu...c/reduction.pdf[/url]

Q-man · February 2, 2010, 4:15pm

The reason I’m doing it like this is because I’m constructing a structure in the on-chip memory of the device, so I don’t want to do multiple calls to the kernel so I wouldn’t have to move the structure to the global memory to maintain it. That’s why I want the host to send query data to the device using zero-copy memory and the kernel to be waiting for the data and then process it and return.

But it seems zero-copy memory is more suited to device sending data to host not the other way around. That’s what I’m getting from reading about synchronizing sending data through zero-copy memory and testing it.

So is it possible?

uriv · May 31, 2010, 2:17pm

Hello,

can someone comment if this is still the case with Fermi? (impossibility for CPU to transfer data to GPU through mapped memory while kernel is working)

Nighthawk13 · May 31, 2010, 5:03pm

In theory, you should be able to:

1.) write your data into mapped host mem

2.) threadfence_system(); // block until all changes are visible to the host, Fermi only

3.) set hostThreadOkToRead bool flag in mapped host mem to true

4.) (host thread polling the hostOkToRead flag in while loop) will at some time detect that the flag changed to true, and can then process the data in host mem

Never tried it though…

silbmarks · June 1, 2010, 7:31pm

The scenario you describe is for the GPU to write to the mapped memory and ensure consistency b/w the order of the writes as observed by the GPU and the host even when the kernel is running.

But even in theory it is not clear what happens if the host writes to the mapped memory during the kernel execution and GPU attempts to read from it. And even less clear what happens if the host writes to the GPU global memory during the kernel execution and the GPU attempts to read that. And finally the combination of those two: host first writes to the GPU global memory, then writes a flag to the GPU in the write-shared part - what will be the order the GPU will see these two… if at all … again, assuming the kernel is continuously running during all these transfers.

uriv · June 2, 2010, 7:52am

My question was about the host “sending” data to the GPU through mapped memory, not the other way around.

Nighthawk13 · June 4, 2010, 4:38pm

Sorry i misunderstood the original question.

There is an equilvalent to “threadfence_system()” for the CPU which might help: the “sfence” SSE instruction. (Blocks until all prending writes have been completed).

I’m not sure whether you can build a bidirection CPU<->GPU protocol with those instructions… would not be very efficient i guess.