I am currently memory mapping a netlist ev3 card to the GPU via:
cudaHostRegister( ptr, size, cudaHostRegisterIoMemory );
Currently, if I pass a GPU device pointer from GPU A to GPU B, CUDA instantiates an automatic DMA P2P transfer from GPU A to GPU B without getting cached on the CPU. I want this same behaviour but with the netlist ev3 card, such that I can pass the memory mapped pointer from the ev3 card to the GPU kernel, and when the GPU reads that address, have it get directly transferred to the GPU without getting cached on the CPU.
I did some benchmarks on reading the mapped pointer from the netlist ev3 card but it runs at ~12GB/s, however it is on a PCIe 3.0 x4 slot which means it must be getting cached on the CPU.
Is there a way to bypass this caching?