we are planning to use CUDA to speed up real time video processing.
If video images are HD, each frame is about 1.5 MB.
I am wondering what is the setup time for texture operations, if we consider making each frame’s data a texture, and use texture fetches instead of global reads?
If texture setup time is large, them it might be good to do plain global memory reads instead. Any experiences on setup times of large textures?
Binding a texture takes only about ~4-10 microseconds extra (it’s been a while since I did that benchmark, I don’t remember the exact number), so it really isn’t much of an overhead.
Using a texture read in the kernel can increase register usage by 2 or 3 which may affect the performance of your code. So, if you can use fully coalesced global memory reads, do so. But if you access memory in a slightly random pattern with 1D or 2D locality, the benefits of the textures will more than pay for the costs.
I am planning to do a 3x3 median filtering for a 4 channel bitmap for now.
I think I am going to need to read 9 32 bit integers from memory for each target pixel. As there is so much locality in this operation, I think binding to a texture is worth it.
It would be better to use shared memory to reduce the number of reads even further. One of the SDK examples shows how to write a filter using shared memory, I believe.