I’ve come across a very troubling issue when profiling one of my image processing functions, which does some processing and writes about 640 bytes of data out to a few memory locations (6 unique memory ranges, each set of 6 writes are completely coalesced).
When I use device (global) memory for storing the results of this image processing function, I see spikes in kernel execution time of up to 10ms (average 5-6ms).
When I use client memory (page locked, accessed via zero copy memory access) for storing the results of the function, I ‘occasionally’ see spikes of up to 4ms (tops - most spikes are 500us or so).
My question is, why would I ever see latency spikes on writing to device memory, especially when it’s only 640 bytes… and when the GPU (all MPs) except the one executing my single block kernel, are idle?
I had always assumed writing small amoutns of data to device memory would be much faster than writing to client memory via DMA (PCI bus)… but this doesn’t appear to be the case?
Am I right to assume I should ‘always’ use zero-copy-memory (where posisble) to write ‘results’ (data that will ‘eventually’ be copied back to the client), and only use device memory for old hardware?