Copy from texture memory to shared memory Confused about best transfer strategy

Hello Forum,

does it make sense to make use of texture memory to copy a whole block into the shared memory? Using the CUDA profiler I discovered that the transfer time from host to the device global memory is faster as the upload from host to the device texture memory. Therefor the fetching from texture memory is faster as fetching from global memory. Now I’m a little bit confused what strategy makes more sense.

  1. upload from host to texture memory and then to shared memory

  2. upload from host to global memory and then to shared memory

Thank you for helping answers!

Best regards,

It mostly depends on your memory access pattern. If you can manage coallescing use global memory, if your accesses are

scattered and “random” use textures.

In any case if you re-use the data over and over, or need all threads in a thread-block to iterate over all the data you’ve loaded

then use shared memory as well.


I would use either textures or shared memory, but I don’t think I would use both together.

The big advantage with textures is that you get a read cache, so for data with relatively tight spatial locality, you can get a useful speed up over global memory alone without the need for read patterns which will coalesce. But there can also be cache misses, which adds and additional penalty and can make textures slower the global memory. On average, textures are usually faster than “naked” global memory loads. The fact you can also do filtering/interpolation for free at the same time can yield big performance wins, if you need it.

One the other hand, coalesced reads into shared memory are usually worthwhile when you need non-linear global memory reads which can be assembled block-wise into coalesced reads, and you need to re-use data more than once across several threads within the same block. Fully coalesced global memory loads are basically the fastest off chip memory access method there is. If you can use them, you probably ought to prefer them over textures (unless you can also exploit filtering).

Knowing which one is most suitable requires analysis of your global memory access patterns.

Hello Eyal,

thank your very much for your quick reply.

In my case I would like to apply an demosaicing filter to an char array (IplImage from OpenCV). So I’ll iterate over all the data and won’t access the texture randomly. Unfortunately I’m a bloody beginner in CUDA and for that reason not familiar with coallescing… but I promise to know more about this next time.

Do you think you can tell me (bloody beginner) in some words what coallescing is? In my programm I do following steps:


int size = sizeof(unsigned char)*iplIn->width*iplIn->height;

cudaMalloc((void**)&DEVres, size*3);

cudaMemcpy(DEVgray, iplIn->imageData, size, cudaMemcpyHostToDevice);


Later on in the kernel I use three IF-statements to calculate the colors. That’s not very ladylike and results in a bad performance… BUT: it works! (for now)

I’ll be very happy about any hints.



Search the programming guide for coallescing. In simple words you need your threads (or half-wrap which is a group of 16 threads)

to access contigues memory address. in case your access is not coallesced you’ll suffer from bad performance and the

hardware will make needless trips to global memory (which are not needed in case your access pattern is coalesced).

Different GPU cards have different rules (the newer ones have more relaxed rules) - the easiest would be to look at

the programming guide :)

as for your code - maybe you can post the kernel so we can have a look at it and see how to improve it…