Is it possible to overlap small IOs and computations?

Basically, I have the following IO bound code I’m trying to improve:

copy 0.5MiB host to device (0.13ms)
CV pattern recognition & histogram (0.3ms)
copy 0.5MiB device to host (0.13ms)

I know you’re going to say it’s not worth doing such small computation on the GPU, but this is mostly a curiosity (I already have a SIMDized, multithreaded version that’s equally as fast when including GPU memcpy cost).

I saw the idea of staged execution from the CUDA manuals and tried it, but didn’t get any improvement. I wrote a benchmark just to test staged execution and was able to completely overlap IO with computation when their individual times are ~10ms. For smaller inputs, staged execution doesn’t improve at all.

See staged.cu

Even when STAGES = 1, the staged version takes 1.9ms while unstaged only takes 0.6ms. Does anyone know if this code can be staged successfully? Maybe there’s some arcane usage of streams? (I noticed changing streams[i] to 0 restores performance back to that of unstaged, albeit no overlap).
staged.cu (1.84 KB)

Uncle,
Use pinned memory to ACCELERATE your copies…

zero-copy is almost certainly the way to go for small transfers.