Custom CPU to GPU ringbuffer

As mentioned before:

  • using “volatile” for each mapped host mem pointer bypasses the L2 cache when reading that pointer(this behaviour is not explicitly guranteed by the programming manual but seems to suffice for now).

  • the GPU(while polling) might see host data updates in an order different as the CPU(while writing) - this weakly ordering can be fixed by using a _mm_mfence() (Not sure if this is a real-world problem)

As tmurray mentioned, the whole idea is rather experimental, expect bad performance.

Idea for possible performance improvement: After polling the mapped host mem for command counter(and deciding there is probably no command for this thread block ready soon), do several high latency operations before polling again to reduce PCI-E load. (Use texture accesses as pseudo-yield for example as they always have high latency).

How did you know this? Are you sure?