Hi, we have a bandwidth/latency problem.
We have a stack of images, typically ~500, with sizes of about 512*512 2bytes per pixel (512kb a piece). These are scattered about in the host memory. We want to upload them into a single 3d array on the card, and we want to do it fast. Worst case, we’ll want to upload them again for each rendering we do. The overhead of memcpying each image in the stack to the GPU is about 230microseconds, landing up a massive overhead of 125 milliseconds. There doesn’t seem to be any kind of queueing or overlap between these calls either.
I’ve tried memcpying on the host into a contiguous block, and this results in a refreshingly fast upload (a pinned contiguous block is even better but less practical). However, the host-host memcpy is really slow and rules this out. (Not to mention the fact doubling these image stacks is rather unfriendly.)
So, I’m looking for any possible solutions to the problem! Any suggestions are greatly appreciated. (or comments from Nvidia regarding why multiple transfers in this pattern won’t overlap nicely?)