In my cuqu projects (http://code.google.com/p/cuqu/), I use a chunk of pinned memory (by cudaHostAlloc) to exchange messages between host and device.
You can see the host->device data path at http://code.google.com/p/cuqu/source/browse/cuqu/detail/host_queue_inl.h#176
In one test program, I get a weird behaviour in that the device (Fermi C2050) ‘sees’ the incremented memory location at line 179 but does not that at line 176.
Of course it can be something else, but I have 100% reproducible situation in which, in a loop, after 5 iterations I see the problem above.
I end up speculating somewhat on my ignorance of the way a pinned-memory, which is mapped on a device, is treated e.g. from the point of view of caching:
- Does every access to that memory translate to a PCIe transaction to read/write the host DRAM ? or rather Fermi L1/L2 cache are used ?
- How much does it take for an update done on the host to be seen on the device e.g. during a polling read on the memory ?
- Are [mls]fence inst on the host enough to guarantee ordering on the device?
- Do I need a full host cache flush instead ?