Shared Memory usage slows kernel with texture fetch

Hey,

I just noticed a very unexpected behaviour of CUDA:

I have a kernel that fetches a 2D-Texture, loads some variables out of global memory and makes some computations. Since the variables need to be read from global memory very often I tried to copy the variables first from global to shared memory and then read from shared memory.

The results for two kernels doing roughly the same (one that uses texture fetching, one that does not use textures):

  • no textures: speed up about 30-40% (shared memory vs. global memory)
  • textures: slow down - computation time doubled compared !!! (shared memory vs. global memory)

The slowdown is purely due to the allocation of shared memory (1920*sizeof(float)).

extern shared float somearray;
shared_mem_size = something
kernel<<< blocks, threads_per_block, shared_mem_size>>>(arguments…)

or

shared float somearray[1920]

What causes this extreme slowdown? Texture memory is somewhat cached, but I don’t know how. Why does texture memory “interact” with shared memory? I am thankful for any hints.

Cheers

Peter

Settings:
Tesla C1060
CUDA Version 2.3
WinXP
GPU Architecture sm_13

Hey,

I just noticed a very unexpected behaviour of CUDA:

I have a kernel that fetches a 2D-Texture, loads some variables out of global memory and makes some computations. Since the variables need to be read from global memory very often I tried to copy the variables first from global to shared memory and then read from shared memory.

The results for two kernels doing roughly the same (one that uses texture fetching, one that does not use textures):

  • no textures: speed up about 30-40% (shared memory vs. global memory)
  • textures: slow down - computation time doubled compared !!! (shared memory vs. global memory)

The slowdown is purely due to the allocation of shared memory (1920*sizeof(float)).

extern shared float somearray;
shared_mem_size = something
kernel<<< blocks, threads_per_block, shared_mem_size>>>(arguments…)

or

shared float somearray[1920]

What causes this extreme slowdown? Texture memory is somewhat cached, but I don’t know how. Why does texture memory “interact” with shared memory? I am thankful for any hints.

Cheers

Peter

Settings:
Tesla C1060
CUDA Version 2.3
WinXP
GPU Architecture sm_13

Depending on how your data is laid out in shared memory, you might have bank conflicts.
Are you using texture fetches to move the data from global memory to shared memory? If not, you might read some data twice if it is not properly aligned. Try using textures and shared memory.

Depending on how your data is laid out in shared memory, you might have bank conflicts.
Are you using texture fetches to move the data from global memory to shared memory? If not, you might read some data twice if it is not properly aligned. Try using textures and shared memory.

Actually I just allocated an array in the shared memory. I do not write anything into it. So there shouldn’t be any bank conflicts up to this point. That’s why I think this slow down is so very strange.

Actually I intend to use the array in shared memory to copy some data from global memory (not the texture) into it and then read without bank conflicts. But if allocating slows down the computation like 1 or 2 seconds, shared memory usage is senseless. This is NOT as it should be, isn’t it?

Actually I just allocated an array in the shared memory. I do not write anything into it. So there shouldn’t be any bank conflicts up to this point. That’s why I think this slow down is so very strange.

Actually I intend to use the array in shared memory to copy some data from global memory (not the texture) into it and then read without bank conflicts. But if allocating slows down the computation like 1 or 2 seconds, shared memory usage is senseless. This is NOT as it should be, isn’t it?

That is a strong indication that in your case use of shared memory decreases occupancy. Using 4*1920 bytes of shared memory will allow only two blocks to run per SM, while apparently more blocks could run without shared memory.

Can you have more threads per block without increasing shared memory needs per block? Or can you reduce the amount of shared memory so that 3 blocks would fit to an SM? Otherwise the texture version might indeed be the way to go.

Note that the texture cache is even smaller than shared memory. So if your data does not fit to shared data, it won’t completely fit to the texture cache either. In that case you might still achieve large gains by reordering texture accesses to allow maximum reuse of data before it is evicted from the texture cache.

That is a strong indication that in your case use of shared memory decreases occupancy. Using 4*1920 bytes of shared memory will allow only two blocks to run per SM, while apparently more blocks could run without shared memory.

Can you have more threads per block without increasing shared memory needs per block? Or can you reduce the amount of shared memory so that 3 blocks would fit to an SM? Otherwise the texture version might indeed be the way to go.

Note that the texture cache is even smaller than shared memory. So if your data does not fit to shared data, it won’t completely fit to the texture cache either. In that case you might still achieve large gains by reordering texture accesses to allow maximum reuse of data before it is evicted from the texture cache.

That sounds like a reason! I am able to let 3 blocks run on one SM, or to increase the number of threads per block - but only if I admit bank conflicts. Which again reduces the speed…

I guess my explanation was a bit awkward. I have 2 kernels simulating something (let’s call them A and B). The simulation needs some parameters which are stored in global memory. Because every thread has to access those parameters several times, I wanted to copy them to shared memory. Just to avoid global memory access.

Now, for kernel A I got a huge speed up.

Kernel B also uses texture fetches (it is really used as a surface), but does the same thing as kernel A otherwise. But B is faster if it reads the parameters out of global memory, instead of shared memory. So the idea didn’t work here.

Somehow the texture caching influences the shared memory access speed. Do they even require parts of the shared memory? Is that possible?