I just noticed a very unexpected behaviour of CUDA:
I have a kernel that fetches a 2D-Texture, loads some variables out of global memory and makes some computations. Since the variables need to be read from global memory very often I tried to copy the variables first from global to shared memory and then read from shared memory.
The results for two kernels doing roughly the same (one that uses texture fetching, one that does not use textures):
- no textures: speed up about 30-40% (shared memory vs. global memory)
- textures: slow down - computation time doubled compared !!! (shared memory vs. global memory)
The slowdown is purely due to the allocation of shared memory (1920*sizeof(float)).
extern shared float somearray;
shared_mem_size = something
kernel<<< blocks, threads_per_block, shared_mem_size>>>(arguments…)
shared float somearray
What causes this extreme slowdown? Texture memory is somewhat cached, but I don’t know how. Why does texture memory “interact” with shared memory? I am thankful for any hints.
CUDA Version 2.3
GPU Architecture sm_13