does it make sense to make use of texture memory to copy a whole block into the shared memory? Using the CUDA profiler I discovered that the transfer time from host to the device global memory is faster as the upload from host to the device texture memory. Therefor the fetching from texture memory is faster as fetching from global memory. Now I’m a little bit confused what strategy makes more sense.
upload from host to texture memory and then to shared memory
upload from host to global memory and then to shared memory
I would use either textures or shared memory, but I don’t think I would use both together.
The big advantage with textures is that you get a read cache, so for data with relatively tight spatial locality, you can get a useful speed up over global memory alone without the need for read patterns which will coalesce. But there can also be cache misses, which adds and additional penalty and can make textures slower the global memory. On average, textures are usually faster than “naked” global memory loads. The fact you can also do filtering/interpolation for free at the same time can yield big performance wins, if you need it.
One the other hand, coalesced reads into shared memory are usually worthwhile when you need non-linear global memory reads which can be assembled block-wise into coalesced reads, and you need to re-use data more than once across several threads within the same block. Fully coalesced global memory loads are basically the fastest off chip memory access method there is. If you can use them, you probably ought to prefer them over textures (unless you can also exploit filtering).
Knowing which one is most suitable requires analysis of your global memory access patterns.
In my case I would like to apply an demosaicing filter to an char array (IplImage from OpenCV). So I’ll iterate over all the data and won’t access the texture randomly. Unfortunately I’m a bloody beginner in CUDA and for that reason not familiar with coallescing… but I promise to know more about this next time.
Do you think you can tell me (bloody beginner) in some words what coallescing is? In my programm I do following steps:
Later on in the kernel I use three IF-statements to calculate the colors. That’s not very ladylike and results in a bad performance… BUT: it works! (for now)