I’m developing a program where the host sends data to the device after a kernel invocation. So the kernel is running some code then it loops waiting for some data from the host through the mapped memory. After the kernel reads the data the host should also loop waiting for the kernel to finish processing the data and then it reads the processed data from the mapped memory also.
When I tried running the code I get an infinite loop. I know when using mapped memory either context or events or streams should be used to synchronize the use of the data in mapped memory. But contexts and streams require the kernel to finish running, so I may need to use events for the synchronization.
How can I successfully run this code? And how can I use events to synchronize using mapped memory while kernel is still running?
Sounds like zero copy memory would be useful. You can have the GPU write directly into the host memory across the PCI-e bus. The host can poll a status word until the kernel signals it is finished. It is, however, only supported on GT200s and the Ion/9300M/9400M/MCP79a family.
Reading/writing zero-copy memory from the host while a kernel is being executed that accesses the same region is undefined; the PCIe controller may arbitrarily reorder transactions, so there’s nothing you can do to guarantee that behavior.
Well that goes that idea then. I must admit I have only used zero copy for pushing back reduction output from the device while the host thread was sitting behind a synchronization barrier. So zero copy isn’t a faux interprocess communication mechanism in the making…
Every so often I try some new idea to make this work, and every so often it fails miserably again. So no, currently it’s not capable of doing anything really spooky.
Isn’t there any other way to transfer data (such as sending queries and getting the result of a query back) between host and device back and forth while the kernel is still running?
This issue is really dependent on the work I’m doing so any help is appreciated.
Then what promises do you have when using more controlled access such as atomics? Specifically, what’s the policy guarantee when the zero-copy memory is accessed via AtomicAdd? If 10000 threads in different blocks all atomicAdd the zero-copy memory, is the final result (after the kernel has completed) guaranteed to be correct? The atomic access should be tolerant to reordering.
I ask because I have used zero-copy memory with atomics and they seem to work.
So long as there is a single writer, it’s guaranteed to be coherent. It’s once you start reading/writing from both CPU and GPU at the same time that everything breaks.
What about some sort of packet protocol, which would compute a checksum or a hash of some sort to tell when the entire packet has arrived across the bus? It wouldn’t be very efficient, but would it work, at least in principle?
struct Packet {
long long hashcode;
int id;
Data data;
}
while (incomingPacket.id == lastPacket.id) {
//no packet has been sent yet
}
while (incomingPacket.hashcode != computeHashCode(packet)) {
//the packet is still not fully transferred
}
The computeHashCode would obviously have to include the packet id, so for two different packets, neither the hashcode nor the id should ever match.
It would be nice if there were a way to flush the PCIe bus.
Ok I modified the code so that the host will be writing to the mapped memory just once and the kernel in the device should be busy waiting until the data is written to mapped memory. Then the kernel should process this data and return the result when it exits.
I’m still getting false results even when I’m just returning back the same data that the kernel read from mapped memory. I don’t know if synchronization is required but all the synchronization methods for mapped memory available that I read assumes the mapped memory data is written by the kernel not the host. So how can I synchronize it the other way around? Even though I’m basically synchronizing with the busy waiting method I’m doing by making the threads loop until the data in the mapped memory changes then the kernel will start processing.
There is a much easier way to do this, similar to how to synchronise across all blocks… basically have your Kernel exit when it’s done rather than loop waiting for some event, then have the event call the kernel again, scince the overhead for kernel calls is very low this is a good way to go.
The reason I’m doing it like this is because I’m constructing a structure in the on-chip memory of the device, so I don’t want to do multiple calls to the kernel so I wouldn’t have to move the structure to the global memory to maintain it. That’s why I want the host to send query data to the device using zero-copy memory and the kernel to be waiting for the data and then process it and return.
But it seems zero-copy memory is more suited to device sending data to host not the other way around. That’s what I’m getting from reading about synchronizing sending data through zero-copy memory and testing it.
can someone comment if this is still the case with Fermi? (impossibility for CPU to transfer data to GPU through mapped memory while kernel is working)
2.) threadfence_system(); // block until all changes are visible to the host, Fermi only
3.) set hostThreadOkToRead bool flag in mapped host mem to true
4.) (host thread polling the hostOkToRead flag in while loop) will at some time detect that the flag changed to true, and can then process the data in host mem
The scenario you describe is for the GPU to write to the mapped memory and ensure consistency b/w the order of the writes as observed by the GPU and the host even when the kernel is running.
But even in theory it is not clear what happens if the host writes to the mapped memory during the kernel execution and GPU attempts to read from it. And even less clear what happens if the host writes to the GPU global memory during the kernel execution and the GPU attempts to read that. And finally the combination of those two: host first writes to the GPU global memory, then writes a flag to the GPU in the write-shared part - what will be the order the GPU will see these two… if at all … again, assuming the kernel is continuously running during all these transfers.
There is an equilvalent to “threadfence_system()” for the CPU which might help: the “sfence” SSE instruction. (Blocks until all prending writes have been completed).
I’m not sure whether you can build a bidirection CPU<->GPU protocol with those instructions… would not be very efficient i guess.