What is the runtime for a data transfer from host to device or device to host for an array? Is it O(n) or is there some hardware feature that allows it to transfer faster? Just trying to make the determination when the transfer will outweigh the benefit of parallelizing a small loop. Thanks!
Oh, but if only some sneaky hardware designer could invent a way to transfer n bytes in less than O(n) scaling! Memory walls would be a thing of the past!
Might not be able to reduce past O(n), but with some parallel data transfer it might perhaps be possible to reduce by a factor of 4, 8, etc. A detail that would be lost in the big O model.