I am trying to optimze my cuda implementation,after investigation with cuda visual profiler i found that my proble is the memory uncoalesce

I have let say 2000 block with 128 threads each
Thread of each block share some commun data (data structre (13 interger variable) )
How can I coalesce the reading of my global memory so that I first load the data into share memory.

I read this thread…st=#entry452234
the problem here is that I am confuse with the way he his reading data (int and struct hello ) and what he meaning by smemAOS

Coallacing issue can be removed by binding that memory in texture of the GPU. Note that texture memory is read only.